Predicting Crime Rates throughout the USA (part 1, Murder Analysis)

Felipe Reyes (Universidad Complutense de Madrid)https://ucm.es
2024-08-31

This dataset created by the FBI and obtained from the official website of UC Irvine, an association of American Universities, includes crime rates of 2215 communities and has a total of 147 variables. The dataset is called Communities and Crime and holds law enforcement data, socio-economic data and administrative data about communities in the United States. Since the variables are very diverse, they were recorded all around 1995. There are multiple target variables which are all continous and not binary. Therefore, it is necessary for this analysis to create new dependent variables which will be obtained by calculating the median of murderPerPop and ViolentCrimesPerPop. So two analysis will be conducted, one based on violence and the other on murder. This HMTL document will display the analysis for the murder target variable. However since both murderPerPop and ViolentCrimesPerPop come from the same dataset, the exploratory part is for the most part the same for both, except when examining the target variables’ representations in the independent variables.

Import data set

rm(list = ls())

library("readxl")
crime <- read_excel("/Users/felipereyes/iCloud Drive (Archiv)/Documents/UNI/complutense/TFM/DATA.CommViolPredUnnormalizedData 2.xlsx")

Packages

Explorative Analysis

First off, it is crucial to investigate the dataset:

glimpse(crime)
Rows: 2,215
Columns: 147
$ communityname         <chr> "BerkeleyHeightstownship", "Marpletown…
$ state                 <chr> "NJ", "PA", "OR", "NY", "MN", "MO", "M…
$ countyCode            <chr> "39", "45", "?", "35", "7", "?", "21",…
$ communityCode         <chr> "5320", "47616", "?", "29443", "5068",…
$ fold                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ population            <dbl> 11980, 23123, 29344, 16656, 11245, 140…
$ householdsize         <chr> "3.1", "2.82", "2.43", "2.4", "2.76", …
$ racepctblack          <chr> "1.37", "0.8", "0.74", "1.7", "0.53", …
$ racePctWhite          <chr> "91.78", "95.57", "94.33", "97.35", "8…
$ racePctAsian          <chr> "6.5", "3.44", "3.43", "0.5", "1.17", …
$ racePctHisp           <chr> "1.88", "0.85", "2.35", "0.7", "0.52",…
$ agePct12t21           <chr> "12.47", "11.01", "11.36", "12.55", "2…
$ agePct12t29           <chr> "21.44", "21.3", "25.88", "25.2", "40.…
$ agePct16t24           <chr> "10.93", "10.48", "11.01", "12.19", "2…
$ agePct65up            <chr> "11.33", "17.18", "10.28", "17.57", "1…
$ numbUrban             <dbl> 11980, 23123, 29344, 0, 0, 140494, 287…
$ pctUrban              <chr> "100", "100", "100", "0", "0", "100", …
$ medIncome             <dbl> 75122, 47917, 35669, 20580, 17390, 215…
$ pctWWage              <chr> "89.24", "78.99", "82", "68.15", "69.3…
$ pctWFarmSelf          <chr> "1.55", "1.11", "1.15", "0.24", "0.55"…
$ pctWInvInc            <chr> "70.2", "64.11", "55.73", "38.95", "42…
$ pctWSocSec            <chr> "23.62", "35.5", "22.25", "39.48", "32…
$ pctWPubAsst           <chr> "1.03", "2.75", "2.94", "11.71", "11.2…
$ pctWRetire            <chr> "18.39", "22.85", "14.56", "18.33", "1…
$ medFamInc             <dbl> 79584, 55323, 42112, 26501, 24018, 277…
$ perCapInc             <dbl> 29711, 20148, 16946, 10810, 8483, 1187…
$ whitePerCap           <dbl> 30233, 20191, 17103, 10909, 9009, 1202…
$ blackPerCap           <dbl> 13600, 18137, 16644, 9984, 887, 7382, …
$ indianPerCap          <dbl> 5725, 0, 21606, 4941, 4425, 10264, 214…
$ AsianPerCap           <dbl> 27101, 20074, 15528, 3541, 3352, 10753…
$ OtherPerCap           <dbl> 5115, 5250, 5954, 2451, 3000, 7192, 21…
$ HispPerCap            <dbl> 22838, 12222, 8405, 4391, 1328, 8104, …
$ NumUnderPov           <dbl> 227, 885, 1389, 2831, 2855, 23223, 112…
$ PctPopUnderPov        <chr> "1.96", "3.98", "4.75", "17.23", "29.9…
$ PctLess9thGrade       <chr> "5.81", "5.61", "2.8", "11.05", "12.15…
$ PctNotHSGrad          <chr> "9.9", "13.72", "9.09", "33.68", "23.0…
$ PctBSorMore           <chr> "48.18", "29.89", "30.13", "10.81", "2…
$ PctUnemployed         <chr> "2.7", "2.43", "4.01", "9.86", "9.08",…
$ PctEmploy             <chr> "64.55", "61.96", "69.8", "54.74", "52…
$ PctEmplManu           <chr> "14.65", "12.26", "15.95", "31.22", "6…
$ PctEmplProfServ       <chr> "28.82", "29.28", "21.52", "27.43", "3…
$ PctOccupManu          <chr> "5.49", "6.39", "8.79", "26.76", "10.9…
$ PctOccupMgmtProf      <chr> "50.73", "37.64", "32.48", "22.71", "2…
$ MalePctDivorce        <chr> "3.67", "4.23", "10.1", "10.98", "7.51…
$ MalePctNevMarr        <chr> "26.38", "27.99", "25.78", "28.15", "5…
$ FemalePctDiv          <chr> "5.22", "6.45", "14.76", "14.47", "11.…
$ TotalPctDiv           <chr> "4.47", "5.42", "12.55", "12.91", "9.7…
$ PersPerFam            <chr> "3.22", "3.11", "2.95", "2.98", "2.98"…
$ PctFam2Par            <chr> "91.43", "86.91", "78.54", "64.02", "5…
$ PctKids2Par           <chr> "90.17", "85.33", "78.85", "62.36", "5…
$ PctYoungKids2Par      <chr> "95.78", "96.82", "92.37", "65.38", "6…
$ PctTeen2Par           <chr> "95.81", "86.46", "75.72", "67.43", "7…
$ PctWorkMomYoungKids   <chr> "44.56", "51.14", "66.08", "59.59", "6…
$ PctWorkMom            <chr> "58.88", "62.43", "74.19", "70.27", "6…
$ NumKidsBornNeverMar   <dbl> 31, 43, 164, 561, 402, 1511, 263, 2368…
$ PctKidsBornNeverMar   <chr> "0.36", "0.24", "0.88", "3.84", "4.7",…
$ NumImmig              <dbl> 1277, 1920, 1468, 339, 196, 2091, 2637…
$ PctImmigRecent        <chr> "8.69", "5.21", "16.42", "13.86", "46.…
$ PctImmigRec5          <chr> "13", "8.65", "23.98", "13.86", "56.12…
$ PctImmigRec8          <chr> "20.99", "13.33", "32.08", "15.34", "6…
$ PctImmigRec10         <chr> "30.93", "22.5", "35.63", "15.34", "69…
$ PctRecentImmig        <chr> "0.93", "0.43", "0.82", "0.28", "0.82"…
$ PctRecImmig5          <chr> "1.39", "0.72", "1.2", "0.28", "0.98",…
$ PctRecImmig8          <chr> "2.24", "1.11", "1.61", "0.31", "1.18"…
$ PctRecImmig10         <chr> "3.3", "1.87", "1.78", "0.31", "1.22",…
$ PctSpeakEnglOnly      <chr> "85.68", "87.79", "93.11", "94.98", "9…
$ PctNotSpeakEnglWell   <chr> "1.37", "1.81", "1.14", "0.56", "0.39"…
$ PctLargHouseFam       <chr> "4.81", "4.25", "2.97", "3.93", "5.23"…
$ PctLargHouseOccup     <chr> "4.17", "3.34", "2.05", "2.56", "3.11"…
$ PersPerOccupHous      <chr> "2.99", "2.7", "2.42", "2.37", "2.35",…
$ PersPerOwnOccHous     <chr> "3", "2.83", "2.69", "2.51", "2.55", "…
$ PersPerRentOccHous    <chr> "2.84", "1.96", "2.06", "2.2", "2.12",…
$ PctPersOwnOccup       <chr> "91.46", "89.03", "64.18", "58.18", "5…
$ PctPersDenseHous      <chr> "0.39", "1.01", "2.03", "1.21", "2.94"…
$ PctHousLess3BR        <chr> "11.06", "23.6", "47.46", "45.66", "55…
$ MedNumBR              <dbl> 3, 3, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
$ HousVacant            <dbl> 64, 240, 544, 669, 333, 5119, 566, 205…
$ PctHousOccup          <chr> "98.37", "97.15", "95.68", "91.19", "9…
$ PctHousOwnOcc         <chr> "91.01", "84.88", "57.79", "54.89", "5…
$ PctVacantBoarded      <chr> "3.12", "0", "0.92", "2.54", "3.9", "2…
$ PctVacMore6Mos        <chr> "37.5", "18.33", "7.54", "57.85", "42.…
$ MedYrHousBuilt        <dbl> 1959, 1958, 1976, 1939, 1958, 1966, 19…
$ PctHousNoPhone        <chr> "0", "0.31", "1.55", "7", "7.45", "6.1…
$ PctWOFullPlumb        <chr> "0.28", "0.14", "0.12", "0.87", "0.82"…
$ OwnOccLowQuart        <dbl> 215900, 136300, 74700, 36400, 30600, 3…
$ OwnOccMedVal          <dbl> 262600, 164200, 90400, 49600, 43200, 5…
$ OwnOccHiQuart         <dbl> 326900, 199900, 112000, 66500, 59500, …
$ OwnOccQrange          <dbl> 111000, 63600, 37300, 30100, 28900, 35…
$ RentLowQ              <dbl> 685, 467, 370, 195, 202, 215, 463, 186…
$ RentMedian            <dbl> 1001, 560, 428, 250, 283, 280, 669, 25…
$ RentHighQ             <dbl> 1001, 672, 520, 309, 362, 349, 824, 32…
$ RentQrange            <dbl> 316, 205, 150, 114, 160, 134, 361, 139…
$ MedRent               <dbl> 1001, 627, 484, 333, 332, 340, 736, 33…
$ MedRentPctHousInc     <chr> "23.8", "27.6", "24.1", "28.7", "32.2"…
$ MedOwnCostPctInc      <chr> "21.1", "20.7", "21.7", "20.6", "23.2"…
$ MedOwnCostPctIncNoMtg <chr> "14", "12.5", "11.6", "14.5", "12.9", …
$ NumInShelters         <dbl> 11, 0, 16, 0, 2, 327, 0, 21, 125, 43, …
$ NumStreet             <dbl> 0, 0, 0, 0, 0, 4, 0, 0, 15, 4, 0, 49, …
$ PctForeignBorn        <chr> "10.66", "8.3", "5", "2.04", "1.74", "…
$ PctBornSameState      <chr> "53.72", "77.17", "44.77", "88.71", "7…
$ PctSameHouse85        <chr> "65.29", "71.27", "36.6", "56.7", "42.…
$ PctSameCity85         <chr> "78.09", "90.22", "61.26", "90.17", "6…
$ PctSameState85        <chr> "89.14", "96.12", "82.85", "96.24", "8…
$ LemasSwornFT          <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTPerPop       <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTFieldOps     <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasSwFTFieldPerPop  <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasTotalReq         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasTotReqPerPop     <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicReqPerOffic      <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicPerPop           <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ RacialMatchCommPol    <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicWhite         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicBlack         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicHisp          <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicAsian         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PctPolicMinor         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ OfficAssgnDrugUnits   <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ NumKindsDrugsSeiz     <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicAveOTWorked      <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LandArea              <chr> "6.5", "10.6", "10.6", "5.2", "11.5", …
$ PopDens               <chr> "1845.9", "2186.7", "2780.9", "3217.7"…
$ PctUsePubTrans        <chr> "9.63", "3.84", "4.37", "3.31", "0.38"…
$ PolicCars             <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ PolicOperBudg         <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasPctPolicOnPatr   <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasGangUnitDeploy   <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ LemasPctOfficDrugUn   <chr> "0", "0", "0", "0", "0", "0", "0", "0"…
$ PolicBudgPerPop       <chr> "?", "?", "?", "?", "?", "?", "?", "?"…
$ murders               <dbl> 0, 0, 3, 0, 0, 7, 0, 8, 0, 29, 1, 12, …
$ murdPerPop            <chr> "0", "0", "8.3", "0", "0", "4.63", "0"…
$ rapes                 <chr> "0", "1", "6", "10", "?", "77", "4", "…
$ rapesPerPop           <chr> "0", "4.25", "16.6", "57.86", "?", "50…
$ robberies             <chr> "1", "5", "56", "10", "4", "136", "9",…
$ robbbPerPop           <chr> "8.2", "21.26", "154.95", "57.86", "32…
$ assaults              <chr> "4", "24", "14", "33", "14", "449", "5…
$ assaultPerPop         <chr> "32.81", "102.05", "38.74", "190.93", …
$ burglaries            <chr> "14", "57", "274", "225", "91", "2094"…
$ burglPerPop           <chr> "114.85", "242.37", "758.14", "1301.78…
$ larcenies             <chr> "138", "376", "1797", "716", "1060", "…
$ larcPerPop            <chr> "1132.08", "1598.78", "4972.19", "4142…
$ autoTheft             <dbl> 16, 26, 136, 47, 91, 454, 144, 125, 20…
$ autoTheftPerPop       <chr> "131.26", "110.55", "376.3", "271.93",…
$ arsons                <chr> "2", "1", "22", "?", "5", "134", "17",…
$ arsonsPerPop          <chr> "16.41", "4.25", "60.87", "?", "40.05"…
$ ViolentCrimesPerPop   <chr> "41.02", "127.56", "218.59", "306.64",…
$ nonViolPerPop         <chr> "1394.59", "1955.95", "6167.51", "?", …
library(skimr)
crime |> skim()
Table 1: Data summary
Name crime
Number of rows 2215
Number of columns 147
_______________________
Column type frequency:
character 116
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
communityname 0 1 7 28 0 2018 0
state 0 1 2 2 0 48 0
countyCode 0 1 1 3 0 115 0
communityCode 0 1 1 5 0 960 0
householdsize 0 1 1 4 0 198 0
racepctblack 0 1 1 5 0 1172 0
racePctWhite 0 1 2 5 0 1609 0
racePctAsian 0 1 1 5 0 667 0
racePctHisp 0 1 1 5 0 1026 0
agePct12t21 0 1 2 5 0 950 0
agePct12t29 0 1 2 5 0 1184 0
agePct16t24 0 1 1 5 0 947 0
agePct65up 0 1 1 5 0 1221 0
pctUrban 0 1 1 5 0 293 0
pctWWage 0 1 2 5 0 1536 0
pctWFarmSelf 0 1 1 4 0 290 0
pctWInvInc 0 1 2 5 0 1774 0
pctWSocSec 0 1 2 5 0 1548 0
pctWPubAsst 0 1 1 5 0 1125 0
pctWRetire 0 1 2 5 0 1258 0
PctPopUnderPov 0 1 1 5 0 1462 0
PctLess9thGrade 0 1 1 5 0 1275 0
PctNotHSGrad 0 1 1 5 0 1690 0
PctBSorMore 0 1 2 5 0 1660 0
PctUnemployed 0 1 1 5 0 868 0
PctEmploy 0 1 2 5 0 1544 0
PctEmplManu 0 1 1 5 0 1543 0
PctEmplProfServ 0 1 2 5 0 1364 0
PctOccupManu 0 1 1 5 0 1429 0
PctOccupMgmtProf 0 1 1 5 0 1529 0
MalePctDivorce 0 1 1 5 0 948 0
MalePctNevMarr 0 1 2 5 0 1411 0
FemalePctDiv 0 1 1 5 0 1052 0
TotalPctDiv 0 1 1 5 0 994 0
PersPerFam 0 1 1 4 0 161 0
PctFam2Par 0 1 2 5 0 1664 0
PctKids2Par 0 1 2 5 0 1687 0
PctYoungKids2Par 0 1 2 5 0 1679 0
PctTeen2Par 0 1 2 5 0 1635 0
PctWorkMomYoungKids 0 1 2 5 0 1546 0
PctWorkMom 0 1 2 5 0 1433 0
PctKidsBornNeverMar 0 1 1 5 0 768 0
PctImmigRecent 0 1 1 5 0 1481 0
PctImmigRec5 0 1 1 5 0 1658 0
PctImmigRec8 0 1 1 5 0 1755 0
PctImmigRec10 0 1 1 5 0 1808 0
PctRecentImmig 0 1 1 5 0 448 0
PctRecImmig5 0 1 1 5 0 568 0
PctRecImmig8 0 1 1 5 0 670 0
PctRecImmig10 0 1 1 5 0 738 0
PctSpeakEnglOnly 0 1 2 5 0 1416 0
PctNotSpeakEnglWell 0 1 1 5 0 625 0
PctLargHouseFam 0 1 1 5 0 785 0
PctLargHouseOccup 0 1 1 5 0 680 0
PersPerOccupHous 0 1 1 4 0 185 0
PersPerOwnOccHous 0 1 1 4 0 185 0
PersPerRentOccHous 0 1 1 4 0 214 0
PctPersOwnOccup 0 1 2 5 0 1802 0
PctPersDenseHous 0 1 1 5 0 850 0
PctHousLess3BR 0 1 2 5 0 1755 0
PctHousOccup 0 1 2 5 0 1019 0
PctHousOwnOcc 0 1 2 5 0 1801 0
PctVacantBoarded 0 1 1 5 0 715 0
PctVacMore6Mos 0 1 2 5 0 1780 0
PctHousNoPhone 0 1 1 5 0 972 0
PctWOFullPlumb 0 1 1 4 0 181 0
MedRentPctHousInc 0 1 2 4 0 159 0
MedOwnCostPctInc 0 1 2 4 0 154 0
MedOwnCostPctIncNoMtg 0 1 2 4 0 87 0
PctForeignBorn 0 1 1 5 0 1180 0
PctBornSameState 0 1 2 5 0 1833 0
PctSameHouse85 0 1 2 5 0 1690 0
PctSameCity85 0 1 2 5 0 1623 0
PctSameState85 0 1 2 5 0 1337 0
LemasSwornFT 0 1 1 5 0 221 0
LemasSwFTPerPop 0 1 1 7 0 344 0
LemasSwFTFieldOps 0 1 1 5 0 216 0
LemasSwFTFieldPerPop 0 1 1 7 0 343 0
LemasTotalReq 0 1 1 7 0 320 0
LemasTotReqPerPop 0 1 1 10 0 344 0
PolicReqPerOffic 0 1 1 6 0 337 0
PolicPerPop 0 1 1 6 0 323 0
RacialMatchCommPol 0 1 1 5 0 318 0
PctPolicWhite 0 1 1 5 0 318 0
PctPolicBlack 0 1 1 5 0 289 0
PctPolicHisp 0 1 1 5 0 224 0
PctPolicAsian 0 1 1 5 0 114 0
PctPolicMinor 0 1 1 5 0 314 0
OfficAssgnDrugUnits 0 1 1 4 0 68 0
NumKindsDrugsSeiz 0 1 1 2 0 16 0
PolicAveOTWorked 0 1 1 5 0 305 0
LandArea 0 1 1 6 0 572 0
PopDens 0 1 2 7 0 2162 0
PctUsePubTrans 0 1 1 5 0 746 0
PolicCars 0 1 1 4 0 194 0
PolicOperBudg 0 1 1 10 0 342 0
LemasPctPolicOnPatr 0 1 1 5 0 308 0
LemasGangUnitDeploy 0 1 1 2 0 4 0
LemasPctOfficDrugUn 0 1 1 5 0 266 0
PolicBudgPerPop 0 1 1 10 0 344 0
murdPerPop 0 1 1 5 0 906 0
rapes 0 1 1 4 0 172 0
rapesPerPop 0 1 1 6 0 1622 0
robberies 0 1 1 5 0 418 0
robbbPerPop 0 1 1 7 0 2061 0
assaults 0 1 1 5 0 574 0
assaultPerPop 0 1 1 7 0 2150 0
burglaries 0 1 1 5 0 910 0
burglPerPop 0 1 1 8 0 2200 0
larcenies 0 1 1 6 0 1455 0
larcPerPop 0 1 1 8 0 2212 0
autoTheftPerPop 0 1 1 7 0 2173 0
arsons 0 1 1 4 0 179 0
arsonsPerPop 0 1 1 6 0 1578 0
ViolentCrimesPerPop 0 1 1 7 0 1974 0
nonViolPerPop 0 1 1 8 0 2114 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fold 0 1 5.49 2.87 1 3.00 5 8.0 10 ▇▇▇▇▇
population 0 1 53117.98 204620.25 10005 14366.00 22792 43024.0 7322564 ▇▁▁▁▁
numbUrban 0 1 47734.72 205606.69 0 0.00 18041 41918.0 7322564 ▇▁▁▁▁
medIncome 0 1 33984.70 13424.68 8866 23817.00 31441 41480.5 123625 ▇▆▁▁▁
medFamInc 0 1 39857.06 14251.21 10447 29538.00 36678 46999.0 139008 ▇▇▁▁▁
perCapInc 0 1 15603.52 6281.56 5237 11602.50 14101 17795.0 63302 ▇▃▁▁▁
whitePerCap 0 1 16567.70 6346.84 5472 12610.50 15073 18609.5 68850 ▇▂▁▁▁
blackPerCap 0 1 11541.75 9232.10 0 6742.50 9777 14526.0 212120 ▇▁▁▁▁
indianPerCap 0 1 12229.19 14853.84 0 6345.00 9895 14757.5 480000 ▇▁▁▁▁
AsianPerCap 0 1 14227.99 9881.27 0 8285.50 12250 17327.5 106165 ▇▁▁▁▁
OtherPerCap 1 1 9442.77 7926.47 0 5528.25 8186 11525.5 137000 ▇▁▁▁▁
HispPerCap 0 1 11019.00 5884.06 0 7274.00 9721 13418.0 54648 ▇▅▁▁▁
NumUnderPov 0 1 7590.85 39361.46 78 912.50 2142 4988.0 1384994 ▇▁▁▁▁
NumKidsBornNeverMar 0 1 2141.42 14692.58 0 147.00 352 1031.5 527557 ▇▁▁▁▁
NumImmig 0 1 6277.27 55419.65 20 400.00 1024 3302.0 2082931 ▇▁▁▁▁
MedNumBR 0 1 2.64 0.51 1 2.00 3 3.0 4 ▁▅▁▇▁
HousVacant 0 1 1748.37 6503.87 36 304.50 558 1228.0 172768 ▇▁▁▁▁
MedYrHousBuilt 0 1 1962.62 11.17 1939 1956.00 1964 1971.0 1987 ▃▆▇▇▂
OwnOccLowQuart 0 1 88695.80 66670.78 14999 41500.00 65500 121500.0 500001 ▇▂▁▁▁
OwnOccMedVal 0 1 113097.52 81906.36 19500 56200.00 82800 150600.0 500001 ▇▃▁▁▁
OwnOccHiQuart 0 1 145318.26 99030.91 28200 74300.00 106700 188000.0 500001 ▇▃▂▁▁
OwnOccQrange 0 1 56622.46 39106.50 0 32200.00 43400 65450.0 331000 ▇▂▁▁▁
RentLowQ 0 1 329.97 144.14 99 213.50 307 421.0 1001 ▇▇▃▁▁
RentMedian 0 1 428.54 170.71 120 289.50 397 544.0 1001 ▆▇▅▂▁
RentHighQ 0 1 527.25 199.29 182 366.00 486 659.5 1001 ▅▇▆▃▂
RentQrange 0 1 197.29 85.21 0 139.00 171 232.5 803 ▇▇▂▁▁
MedRent 0 1 501.47 169.27 192 364.00 467 615.0 1001 ▅▇▅▂▁
NumInShelters 0 1 66.95 564.25 0 0.00 0 22.0 23383 ▇▁▁▁▁
NumStreet 0 1 17.82 245.45 0 0.00 0 1.0 10447 ▇▁▁▁▁
murders 0 1 7.76 58.17 0 0.00 1 3.0 1946 ▇▁▁▁▁
autoTheft 3 1 516.69 3258.16 1 30.00 75 232.5 112464 ▇▁▁▁▁

After this small approach to the dataset, the first phase of the SEMMA methodology is the cleaning of our data. In this first section we will observe if there are any coding problems in the dataset. It can already be seen that 116 columns are classified as qualitative despite only having 4 real qualitative variables in the dataset, of which all refer to the geography (communityname, state, coutryCode, communityCode). This issue in typology probably has to do with the fact that we have “?” in the dataset that should be treated like Nulls. So we will replace them and then convert the columns’ formats. Furthermore, it can be observed that there are multiple columns that could function as target variables, such as autoTheft, rapesPerPop, nonViolPerPop etc. For this research, and as described in the introduction, we will focus on only two, namely ViolentCrimesPerPop and murdPerPop. On the basis of these two we will create new target variables. Therefore, the remaining ones present in the dataset can be excluded.

Lets filter the qualitative columns out and replace “?” with NA.

if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
library(dplyr)
qual_var <- crime %>% select_if(is.character)
qual_var <- qual_var %>% mutate_all(~na_if(.,"?"))
qual_var
# A tibble: 2,215 × 116
   communityname          state countyCode communityCode householdsize
   <chr>                  <chr> <chr>      <chr>         <chr>        
 1 BerkeleyHeightstownsh… NJ    39         5320          3.1          
 2 Marpletownship         PA    45         47616         2.82         
 3 Tigardcity             OR    <NA>       <NA>          2.43         
 4 Gloversvillecity       NY    35         29443         2.4          
 5 Bemidjicity            MN    7          5068          2.76         
 6 Springfieldcity        MO    <NA>       <NA>          2.45         
 7 Norwoodtown            MA    21         50250         2.6          
 8 Andersoncity           IN    <NA>       <NA>          2.45         
 9 Fargocity              ND    17         25700         2.46         
10 Wacocity               TX    <NA>       <NA>          2.62         
# ℹ 2,205 more rows
# ℹ 111 more variables: racepctblack <chr>, racePctWhite <chr>,
#   racePctAsian <chr>, racePctHisp <chr>, agePct12t21 <chr>,
#   agePct12t29 <chr>, agePct16t24 <chr>, agePct65up <chr>,
#   pctUrban <chr>, pctWWage <chr>, pctWFarmSelf <chr>,
#   pctWInvInc <chr>, pctWSocSec <chr>, pctWPubAsst <chr>,
#   pctWRetire <chr>, PctPopUnderPov <chr>, PctLess9thGrade <chr>, …

Lets exclude communityname, state, country code and communityCode from the new dataset in order to convert the rest into numeric values and then we put the qualitative variables back into the dataset, to maintain the original dataset under “crime” including all variables.

qual_var <- qual_var %>% mutate_at(vars(-communityname, -state, -countyCode, -communityCode), as.numeric) 

quant_var <- crime %>% select_if(is.numeric)

crime <- cbind(qual_var, quant_var)
#crime

Now that they are all numeric, lets exclude the target variables we wont use, and also exclude the four qualitative variables that serve as identifier for the community once again and save this into a new dataset called “crime2”, to leave “crime” untouched.

crime2 <- crime %>% select(-c(countyCode, communityCode, communityname, state, fold,
                              arsonsPerPop, arsons, nonViolPerPop, autoTheftPerPop, autoTheft, larcPerPop, larcenies, burglPerPop, burglaries, assaultPerPop, assaults, robbbPerPop, robberies, rapesPerPop, rapes, murders ))
#crime2

As the NULL values are being classified correctly now, it is useful to investigate if they make up a big part of the overall values. Therefore, it will be checked which variables have many NULL values.

ausentes <- 
  apply(crime2, 2, function(x) sum(is.na(x)))

ausentes_tb <- 
  tibble(Variable = names(crime2), Ausentes = ausentes) |> 
  filter(Ausentes > 0)
ausentes_tb
# A tibble: 24 × 2
   Variable             Ausentes
   <chr>                   <int>
 1 LemasSwornFT             1872
 2 LemasSwFTPerPop          1872
 3 LemasSwFTFieldOps        1872
 4 LemasSwFTFieldPerPop     1872
 5 LemasTotalReq            1872
 6 LemasTotReqPerPop        1872
 7 PolicReqPerOffic         1872
 8 PolicPerPop              1872
 9 RacialMatchCommPol       1872
10 PctPolicWhite            1872
# ℹ 14 more rows
Variable NAs
LemasSwornFT 1872
LemasSwFTPerPop 1872
LemasSwFTFieldOps 1872
LemasSwFTFieldPerPop 1872
LemasTotalReq 1872
LemasTotReqPerPop 1872
PolicReqPerOffic 1872
PolicPerPop 1872
RacialMatchCommPol 1872
PctPolicWhite 1872
PctPolicBlack 1872
PctPolicHisp 1872
PctPolicAsian 1872
PctPolicMinor 1872
OfficAssgnDrugUnits 1872
NumKindsDrugsSeiz 1872
PolicAveOTWorked 1872
PolicCars 1872
PolicOperBudg 1872
LemasPctPolicOnPatr 1872
LemasGangUnitDeploy 1872
PolicBudgPerPop 1872
ViolentCrimesPerPop 221
OtherPerCap 1

It can be seen that there are in total 24 columns with NAs, of which most make up a significant portion of the values, considering that each variable has 2215 observations. They make up to 85% of all their values. Almost all variables here refer to police related factors. Important to note is also that the target variable violence has several NAs, which is not ideal, however they make just up for 10% of the total values, instead of 85%. The general practice in the SEMMA methodology would be to impute the median, mean or mode, but with such high numbers, the newly imputed values probably wont be as representative. However, testing for collinearity wont be possible with such a high number of NAs, wherefore a good solution would be to do the imputation for now, and when checking for collinearity, trying to exclude always the variable with the highest amount of NAs in case there is a high collinearity. For this we will impute the median since the variables are quantitative continous.

The same method will be used for the violence target variable. It can be argued that a good solution would be to exclude the records with NAs, however this would reduce the sample size by 10% and considering that the number of observations is barely above the limit to train and obtain a robust model, each data point is of outmost importance and could contribute significant and unique information. In addition, there could be a pattern behind the empty values, meaning certain values of independent variables are more likely to have missing values. These values would be deleted, which could result in eliminating specific patterns in the data. We are working with real-life data obtained from a reliable authority (FBI), wherefore it can be asssumed that data is not missing randomly for a such important variable. So in order to maintain the diversity and the original sample size, the empty values in ViolentCrimesPerPop wont be deleted but imputed by the median to account for skewness and outliers.

Apart from nulls, also outliers have to be reviewed.
Furthermore, after having corrected the dataset for outliers and NAs, we will create the binary target variables.

For the following sections the variables will be divided into categories identified previously in this research (Economic State/Wealth, Demographic Characteristics, Police-Related Factors and Family Structure), in order to have a better vision and understanding. At the end of this section, two different datasets with the corresponding variables identified as most important will be created, one for each target variable (Violence and Murder).

Outliers

Economic State/Wealth:

Variables:

medIncome, medFamInc, perCapInc, whitePerCap, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, HispPerCap, pctWWage, pctWFarmSelf, pctWInvInc, pctWSocSec, pctWPubAsst, pctWRetire, OwnOccLowQuart, OwnOccMedVal, OwnOccHiQuart, OwnOccQrange, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg

box1 <- 
  ggplot(crime2, aes(murdPerPop, medIncome)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box2 <- 
  ggplot(crime2, aes(murdPerPop, medFamInc)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box3 <- 
  ggplot(crime2, aes(murdPerPop, perCapInc)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box4 <- 
  ggplot(crime2, aes(murdPerPop, whitePerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box5 <- 
  ggplot(crime2, aes(murdPerPop, blackPerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box6 <- 
  ggplot(crime2, aes(murdPerPop, indianPerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box7 <- 
  ggplot(crime2, aes(murdPerPop, AsianPerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box8 <- 
  ggplot(crime2, aes(murdPerPop, OtherPerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box9 <- 
  ggplot(crime2, aes(murdPerPop, HispPerCap)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box10 <- 
  ggplot(crime2, aes(murdPerPop, pctWWage)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box11 <- 
  ggplot(crime2, aes(murdPerPop, pctWFarmSelf)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box13 <- 
  ggplot(crime2, aes(murdPerPop, pctWInvInc)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box14 <- 
  ggplot(crime2, aes(murdPerPop, pctWSocSec)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box15 <- 
  ggplot(crime2, aes(murdPerPop, pctWPubAsst)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box16 <- 
  ggplot(crime2, aes(murdPerPop, pctWRetire)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box17 <- 
  ggplot(crime2, aes(murdPerPop, OwnOccLowQuart)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box18 <- 
  ggplot(crime2, aes(murdPerPop, OwnOccMedVal)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box19 <- 
  ggplot(crime2, aes(murdPerPop, OwnOccHiQuart)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box20 <- 
  ggplot(crime2, aes(murdPerPop, OwnOccQrange)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box21 <- 
  ggplot(crime2, aes(murdPerPop, RentLowQ)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box22 <- 
  ggplot(crime2, aes(murdPerPop, RentMedian)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box23 <- 
  ggplot(crime2, aes(murdPerPop, RentHighQ)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box24 <- 
  ggplot(crime2, aes(murdPerPop, RentQrange)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box25 <- 
  ggplot(crime2, aes(murdPerPop, MedRent)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box26 <- 
  ggplot(crime2, aes(murdPerPop, MedRentPctHousInc)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box27 <- 
  ggplot(crime2, aes(murdPerPop, MedOwnCostPctInc)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box28 <- 
  ggplot(crime2, aes(murdPerPop, MedOwnCostPctIncNoMtg)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2) 
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2) 
multiplot(  box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)

If we look at these box plots, all the variables referring to economic state are asymmetrical, so outliers are being detected and, just as missing values, will be imputed by the median. The only three variables that do not seem to have outliers or at least very few are pctWInvInc, RentHighQ and MedRent.

Demographic Characteristics

Variables:

population, householdsize, racepctblack, racePctWhite, racePctAsian, racePctHisp, agePct12t21, agePct12t29, agePct16t24, agePct65up, numbUrban, pctUrban, PctImmigRecent, PctImmigRec5, PctImmigRec8, PctImmigRec10, PctRecentImmig, PctRecImmig5, PctRecImmig8, PctRecImmig10, PctSpeakEnglOnly, PctNotSpeakEnglWell, PctForeignBorn, PctBornSameState, PctSameHouse85, PctSameCity85, PctSameState85

box1 <- 
  ggplot(crime2, aes(murdPerPop, population)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box2 <- 
  ggplot(crime2, aes(murdPerPop, householdsize)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box3 <- 
  ggplot(crime2, aes(murdPerPop, racepctblack)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box4 <- 
  ggplot(crime2, aes(murdPerPop, racePctAsian)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box5 <- 
  ggplot(crime2, aes(murdPerPop, racePctHisp)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box6 <- 
  ggplot(crime2, aes(murdPerPop, agePct12t21)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box7 <- 
  ggplot(crime2, aes(murdPerPop, agePct12t29)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box8 <- 
  ggplot(crime2, aes(murdPerPop, agePct16t24)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box9 <- 
  ggplot(crime2, aes(murdPerPop, agePct65up)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box10 <- 
  ggplot(crime2, aes(murdPerPop, numbUrban)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box11 <- 
  ggplot(crime2, aes(murdPerPop, pctUrban)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box13 <- 
  ggplot(crime2, aes(murdPerPop, PctImmigRecent)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box14 <- 
  ggplot(crime2, aes(murdPerPop, PctImmigRec5)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box15 <- 
  ggplot(crime2, aes(murdPerPop, PctImmigRec8)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box16 <- 
  ggplot(crime2, aes(murdPerPop, PctImmigRec10)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box17 <- 
  ggplot(crime2, aes(murdPerPop, PctRecentImmig)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box18 <- 
  ggplot(crime2, aes(murdPerPop, PctRecImmig5)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box19 <- 
  ggplot(crime2, aes(murdPerPop, PctRecImmig8)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box20 <- 
  ggplot(crime2, aes(murdPerPop, PctRecImmig10)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box21 <- 
  ggplot(crime2, aes(murdPerPop, PctSpeakEnglOnly)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box22 <- 
  ggplot(crime2, aes(murdPerPop, PctNotSpeakEnglWell)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box23 <- 
  ggplot(crime2, aes(murdPerPop, PctForeignBorn)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box24 <- 
  ggplot(crime2, aes(murdPerPop, PctBornSameState)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box25 <- 
  ggplot(crime2, aes(murdPerPop, PctSameHouse85)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box26 <- 
  ggplot(crime2, aes(murdPerPop, PctSameCity85)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box27 <- 
  ggplot(crime2, aes(murdPerPop, PctSameState85)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box28 <- 
  ggplot(crime2, aes(murdPerPop, racePctWhite)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2) 
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2) 
multiplot(  box21, box22, box23, box24, box25, box26, box27, box28, cols = 2)

Once again almost all variables have outliers. pctUrban, instead of outliers, has a very high number of values in the extremes of the general distribution within the variable. We see the median is on the right extreme side which indicates that the distribution is heavily skewed, since many values are concentrated there. In general most variables are strongly skewed to either the left or right and have a high number of outliers.

box1 <- 
  ggplot(crime2, aes(murdPerPop, LemasSwornFT)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box2 <- 
  ggplot(crime2, aes(murdPerPop, LemasSwFTPerPop)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box3 <- 
  ggplot(crime2, aes(murdPerPop, LemasSwFTFieldOps)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box4 <- 
  ggplot(crime2, aes(murdPerPop, LemasSwFTFieldPerPop)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box5 <- 
  ggplot(crime2, aes(murdPerPop, LemasTotalReq)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box6 <- 
  ggplot(crime2, aes(murdPerPop, LemasTotReqPerPop)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box7 <- 
  ggplot(crime2, aes(murdPerPop, PolicReqPerOffic)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box8 <- 
  ggplot(crime2, aes(murdPerPop, PolicPerPop)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box9 <- 
  ggplot(crime2, aes(murdPerPop, RacialMatchCommPol)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box10 <- 
  ggplot(crime2, aes(murdPerPop, PctPolicWhite)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box11 <- 
  ggplot(crime2, aes(murdPerPop, PctPolicBlack)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box13 <- 
  ggplot(crime2, aes(murdPerPop, PctPolicHisp)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box14 <- 
  ggplot(crime2, aes(murdPerPop, PctPolicAsian)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box15 <- 
  ggplot(crime2, aes(murdPerPop, PctPolicMinor)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box16 <- 
  ggplot(crime2, aes(murdPerPop, OfficAssgnDrugUnits)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box17 <- 
  ggplot(crime2, aes(murdPerPop, NumKindsDrugsSeiz)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box18 <- 
  ggplot(crime2, aes(murdPerPop, PolicAveOTWorked)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box19 <- 
  ggplot(crime2, aes(murdPerPop, LandArea)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box20 <- 
  ggplot(crime2, aes(murdPerPop, PopDens)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box21 <- 
  ggplot(crime2, aes(murdPerPop, PctUsePubTrans)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box22 <- 
  ggplot(crime2, aes(murdPerPop, PolicCars)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box23 <- 
  ggplot(crime2, aes(murdPerPop, PolicOperBudg)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box24 <- 
  ggplot(crime2, aes(murdPerPop, LemasPctPolicOnPatr)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box25 <- 
  ggplot(crime2, aes(murdPerPop, LemasGangUnitDeploy)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box26 <- 
  ggplot(crime2, aes(murdPerPop, LemasPctOfficDrugUn)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box27 <- 
  ggplot(crime2, aes(murdPerPop, PolicBudgPerPop)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

multiplot(box1, box2, box3, box4, box5, box6, box7, box8, box9, box10, cols = 2) 
multiplot(box11, box13, box14, box15, box16, box17, box18, box19, box20,cols = 2) 
multiplot(  box21, box22, box23, box24, box25, box26, box27, cols = 2)

Once again, there are outliers present in almost all of the variables. The only one without outliers is LemasGangUnitDeploy, which has however the most interesting distribution, since it has an exceptional high variability.

Family Structure:

Variables:

MalePctDivorce, MalePctNevMarr, FemalePctDiv, TotalPctDiv, PersPerFam, PctFam2Par, PctKids2Par, PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctWorkMom, NumKidsBornNeverMar, PctKidsBornNeverMar

box1 <- 
  ggplot(crime2, aes(murdPerPop, MalePctDivorce)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box2 <- 
  ggplot(crime2, aes(murdPerPop, MalePctNevMarr)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box3 <- 
  ggplot(crime2, aes(murdPerPop, FemalePctDiv)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box4 <- 
  ggplot(crime2, aes(murdPerPop, TotalPctDiv)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box5 <- 
  ggplot(crime2, aes(murdPerPop, PersPerFam)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box6 <- 
  ggplot(crime2, aes(murdPerPop, PctFam2Par)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box7 <- 
  ggplot(crime2, aes(murdPerPop, PctKids2Par)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box8 <- 
  ggplot(crime2, aes(murdPerPop, PctYoungKids2Par)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box9 <- 
  ggplot(crime2, aes(murdPerPop, PctTeen2Par)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box10 <- 
  ggplot(crime2, aes(murdPerPop, PctWorkMomYoungKids)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box11 <- 
  ggplot(crime2, aes(murdPerPop, PctWorkMom)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

box13 <- 
  ggplot(crime2, aes(murdPerPop, NumKidsBornNeverMar)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()
box14 <- 
  ggplot(crime2, aes(murdPerPop, PctKidsBornNeverMar)) +
  geom_boxplot() +
  theme_minimal() +
  coord_flip()

multiplot(box1, box2, box3, box4, box5, box6, box7, cols = 2) 
multiplot( box8, box9, box10, box11, box13, box14, cols = 2) 

It can be observed that the only variables that seem to be evenly distributed are TotalPctDiv and and FemalePctDiv, as their median is in the center and whiskers have similar length. However they still have outliers, althought considerably less than the remaining variables.

Now that we have identified NAs and have tested for outliers, we have come to the conclusion that almost all variables have outliers. Therefore, we will have to correct these. For the outiers we will use z type with 2.5 in order to detect outliers accordingly and then impute, as mentioned above, the median. In the same step, the NAs present in the variables will be fixed as well.

crime2 <-
    crime2 |> 

    # Detect outliers and label them as NAs
      dplyr::mutate(across(all_numeric_predictors(), function(x) { ifelse(abs(scores(x, type = "z")) > 2.5 & !is.na(x), NA, x) })) |> 

    # Impute NAs
      mutate_if(is.numeric, ~ifelse(is.na(.), median(., na.rm = TRUE), .)) |> 
     mutate_if(is.character, ~ifelse(is.na(.), mode(., na.rm = TRUE), .))

Target Variable

Let’s focus on the target variables.

We must create the new target variables for which it is crucial to know whether to base the new binary target variables that must be calculated, on the mean, median or mode of the values. For this, the distribution of ViolentCrimesPerPop and murdPerPop must be reviewed.

hist(crime2$ViolentCrimesPerPop, main="Distribution Voilence", col="red")
hist(crime2$murdPerPop, main="Distribution Murder", col="red")

We see that both are highly skewed to the right, wherefore calculating the median would be the best solution, so the target variable is as balanced as possible.

median <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}


crime2_v <- crime2  %>%  select(-c(ViolentCrimesPerPop, murdPerPop)) 
crime2_m <- crime2  %>%  select(-c(ViolentCrimesPerPop, murdPerPop))

median_v <- median(crime2$ViolentCrimesPerPop)


crime2_v$violence <- ifelse(crime2$ViolentCrimesPerPop > median_v, 1, 0)


median_m <- mean(crime2$murdPerPop)

crime2_m$murder <- ifelse(crime2$murdPerPop > median_m, 1, 0)

crime2 <- crime2  %>%  select(-c(ViolentCrimesPerPop, murdPerPop)) 
variable <- crime2_v  %>%  select(violence) 
variable2 <- crime2_m  %>%  select(murder) 

crime2 <- cbind(crime2, variable, variable2)



##crime2_v is dataset with all variables plus violence target variable
## crime2_m is dataset with all variables plus murder target variable
## crime2 is dataset with all variables plus both target variable

We created the two datasets for the two target variables: crime2_v and crime2_m.

Lets confirm that the balance of the values for the two target variables is acceptable and at the same time find out which is the minority and majority class. This way we will also know which one will serve as Accuracy Base and which one will serve as Error Rate Base.

Violence:

set.seed(123)
# violence 
table_hd <- table_hd <- table(crime2_v$violence)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 45.01 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 54.99 %
sum(crime2_v$violence == 0)
[1] 1218
sum(crime2_v$violence == 1)
[1] 997
Violence Accuracy Base Error Rate Base
0 54.99 % -
1 - 45.01 %

Murder:

# murder 
table_hd <- table_hd <- table(crime2_m$murder)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 34.36 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 65.64 %
sum(crime2_m$murder == 0)
[1] 1454
sum(crime2_m$murder == 1)
[1] 761
Murder Accuracy Base Error Rate Base
0 65.64% -
1 - 34.36%

The class balance for murder is worse, so oversampling has to be done.

library(ROSE)

# Perform random oversampling
set.seed(1234)
crime2_m_balanced <- ovun.sample(factor(murder) ~ ., data = crime2_m, method = "over", N = 2 * table(crime2_m$murder)[1])$data

table(crime2_m_balanced$murder)

   0    1 
1454 1454 
# murder 
table_hd <- table_hd <- table(crime2_m_balanced$murder)
base_accuracy <- round((table_hd[2]/sum(table_hd))*100, 2)
cat("Reference base Error rate:", base_accuracy, "%\n")
Reference base Error rate: 50 %
ref_base_failure_rate <- round((table_hd[1]/sum(table_hd))*100, 2)
cat("Base accuracy:", ref_base_failure_rate, "%\n")
Base accuracy: 50 %
sum(crime2_m_balanced$murder == 0)
[1] 1454
sum(crime2_m_balanced$murder == 1)
[1] 1454
crime2_m <- crime2_m_balanced

It can be seen that the variables for both target variables are well balanced now.

Collinearity & Representation of the target variable

Now that we fixed outliers and NAs and have our target variables ready, we will check for collinearity to see which variables can be excluded and also check for the representation of the target variable in the independent variables. Similar to the outliers section, we will divide this section into the different variable groups.

For collinearity, a value equal or above 0.5 can be considered as high and at least one of the affected variables must be excluded.

Economic State/Wealth

Collinearity

cor_matrix <- 
  crime2 |> 
  dplyr::select(medIncome, medFamInc, perCapInc, whitePerCap, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, HispPerCap, pctWWage, pctWFarmSelf, pctWInvInc, pctWSocSec, pctWPubAsst, pctWRetire, OwnOccLowQuart, OwnOccMedVal, OwnOccHiQuart, OwnOccQrange, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg) |> 
  cor() |> 
  round(2)

corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")

Lets filter for values equal or above 0.5.

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

we will delete: medincome, medFamInc, whitePerCap, PctWFarmSelf, PctWInvInc, PctWSocSec, PctWPubAsst, PctWRetire, OwnOccLowQuart, OwnOccMedVa, RentLowQ, RentMedian, RentHighQ, RentQrange, MedRent, HispPerCap, OwnOccHiQuart, OwnOccQrange

Lets check that there is no collinearity equal or above 0.5 now:

cor_matrix <- 
  crime2 |> 
  dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg) |> 
  cor() |> 
  round(2)

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

Adding the remaining variables to dataset:

crime2_m_b <- crime2_m
crime2_m <-crime2_m |> select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg, murder) 

Representation

library(ggtext)
age_freq <- table(crime2_m$perCapInc, crime2_m$murder)
age_freq <- as.data.frame(age_freq)
colnames(age_freq) <- c("perCapInc", "murder", "FRECUENCIA")


a1 <- crime2_m |> ggplot(aes(x = murder, y = perCapInc, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in perCapInc") 

a2 <- crime2_m |> ggplot(aes(x = murder, y = blackPerCap, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in blackPerCap") 

a3 <- crime2_m |> ggplot(aes(x = murder, y = indianPerCap, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in indianPerCap") 


a4 <- crime2_m |> ggplot(aes(x = murder, y = AsianPerCap, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in AsianPerCap") 


a5 <- crime2_m |> ggplot(aes(x = murder, y = OtherPerCap, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in OtherPerCap") 

a6 <- crime2_m |> ggplot(aes(x = murder, y = pctWWage, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in pctWWage") 

a7 <- crime2_m |> ggplot(aes(x = murder, y = MedRentPctHousInc, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in MedRentPctHousInc") 

a8 <- crime2_m |> ggplot(aes(x = murder, y = MedOwnCostPctInc, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in MedOwnCostPctInc") 

a9 <- crime2_m |> ggplot(aes(x = murder, y = MedOwnCostPctIncNoMtg, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in MedOwnCostPctIncNoMtg") 


multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9,  cols = 2)

We can see that in each variable the values of violence only have small differences in their distribution, since the for 0 and 1 the box plots show a high degree of similarities. However, for perCapInc it can be seen that the variance is higher for the values belonging to 0 and also more outliers are present. This also holds for AsianPerCap and pctWWage.

Demographic Characteristics:

Collinearity

cor_matrix <- 
  crime2 |> 
  dplyr::select(population, householdsize, racepctblack, racePctWhite, racePctAsian, racePctHisp, agePct12t21, agePct12t29, agePct16t24, agePct65up, numbUrban, pctUrban, PctImmigRecent, PctImmigRec5, PctImmigRec8, PctImmigRec10, PctRecentImmig, PctRecImmig5, PctRecImmig8, PctRecImmig10, PctSpeakEnglOnly, PctNotSpeakEnglWell, PctForeignBorn, PctBornSameState, PctSameHouse85, PctSameCity85, PctSameState85) |> 
  cor() |> 
  round(2)

corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

variables to delete: agePct12t21, racePctAsian, numbUrban, agePct16t24, PctRecImmig10, PctRecImmig5, PctImmigRec5, PctImmigRec10, PctNotSpeakEnglWell, PctSameCity85, PctSameState85, racePctHisp, PctImmigRec8, PctRecentImmig , PctRecImmig8

In general we will delete variables that have absolute values (numbers) instead of percentages (e.g. delete numbUrban instead of pctUrban). We also chose PctImmigRec8 and PctRecImmig8 to leave in, since they refer to the immigration in the last 8 years which is between the other two variables (PctImmigRec5 & PctImmigRec10 and PctRecImmig5 & PctRecImmig10 -> immigration in the last 5 years and 10 years).

cor_matrix <- 
  crime2 |> 
  dplyr::select(population, racepctblack, racePctWhite, agePct12t29, PctImmigRecent,  PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85) |> 
  cor() |> 
  round(2)

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

Adding the chosen variables to dataset:

new_var <- crime2_m_b |> select(population, racepctblack, racePctWhite, agePct12t29, PctImmigRecent,  PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85) 


crime2_m <- cbind(new_var, crime2_m)

Representation

a1 <- crime2_m |> ggplot(aes(x = murder, y = population, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in population") 

a2 <- crime2_m |> ggplot(aes(x = murder, y = racepctblack, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in racepctblack") 

a3 <- crime2_m |> ggplot(aes(x = murder, y = racePctWhite, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in racePctWhite") 


a4 <- crime2_m |> ggplot(aes(x = murder, y = agePct12t29, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in agePct12t29") 


a5 <- crime2_m |> ggplot(aes(x = murder, y = PctImmigRecent, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctImmigRecent") 

a6 <- crime2_m |> ggplot(aes(x = murder, y = PctSpeakEnglOnly, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctSpeakEnglOnly") 

a7 <- crime2_m |> ggplot(aes(x = murder, y = PctForeignBorn, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctForeignBorn") 

a8 <- crime2_m |> ggplot(aes(x = murder, y = PctBornSameState, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctBornSameState") 

a9 <- crime2_m |> ggplot(aes(x = murder, y = PctSameHouse85, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctSameHouse85") 


multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9,  cols = 2)

Looking at the boxplots, it can be observed that in most cases there is a difference in median, variance and outliers between the two values of the target variable. For racepctblack for example the higher median for the value 1, indicates that communities with higher rate of African Americans seem to have above-average murder rates. The opposite holds for racePCtWhite where, despite being more clustered, the median for the percentage of Caucasians in below-average dangerous communities is close to 100.

Collinearity

As seen before, essentially all NA values in the original dataset were present in the variables addressing police-related factors.

cor_matrix <- 
  crime2 |> 
  dplyr::select(LemasSwornFT, LemasSwFTPerPop, LemasSwFTFieldOps, LemasSwFTFieldPerPop, LemasTotalReq, LemasTotReqPerPop, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicBlack, PctPolicHisp, PctPolicAsian, PctPolicMinor, OfficAssgnDrugUnits, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PopDens, PctUsePubTrans, PolicCars, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, PolicBudgPerPop) |> 
  cor() |> 
  round(2)

corrplot::corrplot(cor_matrix, method = "number", tl.cex = 0.5, number.cex = 0.4, type = "lower")

# Select only the entries of the correlation matrix with values greater than 0.49
filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA

corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

As mentioned above, when having the choice, we will try to delete the variables that had a lot of NAs.

to delete: LemasTotReqPerPop,LemasSwornFT, LemasSwFTPerPop, LemasSwFTFieldOps,LemasSwFTFieldPerPop, LemasTotalReq, PctPolicBlack, PctPolicHisp, OfficAssgnDrugUnits, PopDens, PolicBudgPerPop, PolicCars

We are keeping: PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn

It was decided to delete popdense and keep PctUsePubTrans because popdense (population density) appears in similar format in other categories and PctUsePubTrans (Percentage using public transport) seems like an interesting variable that has not been considered.

Unfortunately, we still have to keep variables that had high amounts of NAs such as PolicReqPerOffic, PolicPerPop or PctPolicWhite.

cor_matrix <- 
  crime2 |> 
  dplyr::select(PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn) |> 
  cor() |> 
  round(2)


filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

add chosen variables to dataset:

new_var <- crime2_m_b |> select(PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn) 


crime2_m <- cbind(new_var, crime2_m)

Representation

murder:

a1 <- crime2_m |> ggplot(aes(x = murder, y = PolicReqPerOffic, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PolicReqPerOffic") 

a2 <- crime2_m |> ggplot(aes(x = murder, y = PolicPerPop, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PolicPerPop") 

a3 <- crime2_m |> ggplot(aes(x = murder, y = RacialMatchCommPol, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in RacialMatchCommPol") 


a4 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicWhite, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctPolicWhite") 


a5 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicAsian, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctPolicAsian") 

a6 <- crime2_m |> ggplot(aes(x = murder, y = PctPolicMinor, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctPolicMinor") 

a7 <- crime2_m |> ggplot(aes(x = murder, y = NumKindsDrugsSeiz, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in NumKindsDrugsSeiz") 

a8 <- crime2_m |> ggplot(aes(x = murder, y = PolicAveOTWorked, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PolicAveOTWorked") 

a9 <- crime2_m |> ggplot(aes(x = murder, y = LandArea, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in LandArea") 

a10 <- crime2_m |> ggplot(aes(x = murder, y = PctUsePubTrans, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctUsePubTrans") 

a11 <- crime2_m |> ggplot(aes(x = murder, y = PolicOperBudg, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PolicOperBudg") 

a12 <- crime2_m |> ggplot(aes(x = murder, y = LemasPctPolicOnPatr, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in LemasPctPolicOnPatr") 

a13 <- crime2_m |> ggplot(aes(x = murder, y = LemasGangUnitDeploy, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in LemasGangUnitDeploy")

a14 <- crime2_m |> ggplot(aes(x = murder, y = LemasPctOfficDrugUn, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in LemasPctOfficDrugUn")

multiplot(a1, a2, a3, a4 , cols = 2)
multiplot(a5, a6, a7, a8, a9,  cols = 2)
multiplot(a10, a11, a12, a13, a14,  cols = 2)

Unfortunately, most boxplots turn out to be highly clustered and very similar among the two values of the target variable. This indicates that they have little predictive power, which however can be attributed to the high amount of NAs found in most of them. It highlights one more time the importance to have eliminated most of them during the collinearity check.

Family Structure:

Collinearity

cor_matrix <- 
  crime2 |> 
  dplyr::select(MalePctDivorce, MalePctNevMarr, FemalePctDiv, TotalPctDiv, PersPerFam, PctFam2Par, PctKids2Par, PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctWorkMom, NumKidsBornNeverMar, PctKidsBornNeverMar) |> 
  cor() |> 
  round(2)

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

to delete: NumKidsBornNeverMar, MalePctDivorce, FemalePctDiv,PctFam2Par,PctYoungKids2Par, PctTeen2Par, PctWorkMomYoungKids, PctKidsBornNeverMar

It was decided to keep PctKids2Par (percentage of kids in family housing with two parents) and not PctYoungKids2Par (percentage of kids <=4 in two parent households) nor PctTeen2Par (percentage of kids age 12-17 in two parent households) since it covers both. Same thing with choosing to include PctWorkMom (percentage of working mothers) over PctWorkMomYoungKids (Percentage of working mothers with young kids)

cor_matrix <- 
  crime2 |> 
  dplyr::select(MalePctNevMarr, TotalPctDiv, PersPerFam,  PctKids2Par, PctWorkMom) |> 
  cor() |> 
  round(2)

filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

Adding new variables to dataset:

new_var <- crime2_m_b |> select(MalePctNevMarr, TotalPctDiv, PersPerFam,  PctKids2Par, PctWorkMom) 


crime2_m <- cbind(new_var, crime2_m)

Representation

murder:

a1 <- crime2_m |> ggplot(aes(x = murder, y = MalePctNevMarr, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in MalePctNevMarr") 

a2 <- crime2_m |> ggplot(aes(x = murder, y = TotalPctDiv, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in TotalPctDiv") 

a3 <- crime2_m |> ggplot(aes(x = murder, y = PersPerFam, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PersPerFam") 


a4 <- crime2_m |> ggplot(aes(x = murder, y = PctKids2Par, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctKids2Par") 


a5 <- crime2_m |> ggplot(aes(x = murder, y = PctWorkMom, fill = factor(murder))) +
  geom_boxplot(alpha=0.3) +
  coord_flip() + labs(y="",  title="Murder in PctWorkMom") 


multiplot(a1, a2, a3, a4 ,a5, cols = 2)

As opposed to variables related to the police force, here we can see the distribution of the values of the target variable within the independent variable well. TotalPctDiv (Total percentage of divorced people) for example tends to have its higher values in murder = 1, which suggests that several above-average dangerous communities have a higher divorce rate.

Lastly, we will do a heat map for the collinearity for all the variables left to see if there is an overlap of information across the different groups.

cor_matrix <- 
  crime2 |> 
  dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctInc, MedOwnCostPctIncNoMtg, population, racepctblack, racePctWhite, agePct12t29, PctImmigRecent,  PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, MalePctNevMarr, TotalPctDiv, PersPerFam,  PctKids2Par, PctWorkMom) |> 
  cor() |> 
  round(2)


filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

Lets exclude PctKids2Par, population, MalePctNevMarr and MedOwnCostPctInc from the final dataset.

crime2_m <- crime2_m |> select(-c(PctKids2Par, population,  MalePctNevMarr, MedOwnCostPctInc) )
cor_matrix <- 
  crime2 |> 
  dplyr::select(perCapInc, blackPerCap, indianPerCap, AsianPerCap, OtherPerCap, pctWWage, MedRentPctHousInc, MedOwnCostPctIncNoMtg, racepctblack, racePctWhite, agePct12t29, PctImmigRecent,  PctSpeakEnglOnly, PctForeignBorn, PctBornSameState, PctSameHouse85, PolicReqPerOffic, PolicPerPop, RacialMatchCommPol, PctPolicWhite, PctPolicAsian, PctPolicMinor, NumKindsDrugsSeiz, PolicAveOTWorked, LandArea, PctUsePubTrans, PolicOperBudg, LemasPctPolicOnPatr, LemasGangUnitDeploy, LemasPctOfficDrugUn, TotalPctDiv, PersPerFam, PctWorkMom) |> 
  cor() |> 
  round(2)


filtered_matrix <- abs(cor_matrix)
filtered_matrix[cor_matrix <= 0.49] <- NA


corrplot(filtered_matrix, method = "number", type = "lower",tl.cex = 0.55, number.cex = 0.5, na.label=" ")

Last stage of depuration

Standardizing

Now we have to transform the numerical variables of different scales (except the target variable) to a common scale to better analyze and draw comparisons among them.

listconti= c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",  "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)
vardep<-"murder"

means <-apply(crime2_m[,listconti],2,mean,na.rm=TRUE)
sds<-sapply(crime2_m[,listconti],sd,na.rm=TRUE)

crime_m_2<-scale(crime2_m[,listconti], center = means, scale = sds)

crime_m_final <- data.frame(crime_m_2)


new_var <- crime2_m_b |> select(murder) 


crime_m_final <- cbind(new_var, crime_m_final)
archivo_m <-crime_m_final 

dput(names(crime_m_final))
c("murder", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", 
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", 
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", 
"PersPerFam", "PctWorkMom")
vardep= c("murder") 
variable_indepe = c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg",  "racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", "PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", "PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", "PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", "LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", "PersPerFam", "PctWorkMom"
)

y<-archivo_m[,vardep]
x<-archivo_m[,variable_indepe]

Before we continue we are going to use parallel computing to do a parallelization by using all the cores of the device except for 1 and creating a cluster.

GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1) 
registerDoParallel(cluster) 

We continue with the selection of variables, using different methods:

AIC

 full<-glm(murder~.,data=archivo_m,family = binomial(link="logit"))
null<-glm(murder~1,data=archivo_m,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full),
direction="both",family = binomial(link="logit"),trace=FALSE)
vec<-(names(selec1[[1]]))
length(vec)
[1] 18
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")
AIC_list <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")

BIC

We calculate k with the formula k=log(n): k=log(2908) -> k=7.975

full<-glm(murder~.,data=archivo_m,family = binomial(link="logit"))
null<-glm(murder~1,data=archivo_m,family = binomial(link="logit"))
selec1<-stepAIC(null,scope=list(upper=full), 
direction="both",family = binomial(link="logit"),trace=FALSE,k=7.975)
vec<-(names(selec1[[1]]))
length(vec)
[1] 9
dput(vec)
c("(Intercept)", "racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn"
)
BIC_list <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn"
)

Wraper Boruta

out.boruta <- Boruta(murder~., data = crime_m_final)
print(out.boruta)
Boruta performed 16 iterations in 1.130421 mins.
 33 attributes confirmed important: agePct12t29, AsianPerCap,
blackPerCap, indianPerCap, LandArea and 28 more;
 No attributes deemed unimportant.
summary(out.boruta)
              Length Class    Mode     
finalDecision  33    factor   numeric  
ImpHistory    576    -none-   numeric  
pValue          1    -none-   numeric  
maxRuns         1    -none-   numeric  
light           1    -none-   logical  
mcAdj           1    -none-   logical  
timeTaken       1    difftime numeric  
roughfixed      1    -none-   logical  
call            3    -none-   call     
impSource       1    -none-   character
sal<-data.frame(out.boruta$finalDecision)

sal2<-sal[which(sal$out.boruta.finalDecision=="Confirmed"),,drop=FALSE]

dput(row.names(sal2))
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", 
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", 
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", 
"PersPerFam", "PctWorkMom")
Boruta_list <- 
  c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", 
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", 
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", 
"PersPerFam", "PctWorkMom")

MXM

mmpc1 <- MMPC(target = vardep, dataset=crime_m_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
mmpc1@selectedVars
[1]  3  8 10 11 23 28 31 32
a<-dput(names(archivo_m[,c(mmpc1@selectedVars)]))
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite", 
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
length(a)
[1] 8
MXM_list <- c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite", 
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)

Wrapper SES

SES1 <- SES(vardep, crime_m_final, max_k = 3, hash = TRUE,
test = "testIndLogistic")
SES1@selectedVars
[1]  3  8 10 11 23 28 31 32
dput(names(crime_m_final[,c(SES1@selectedVars)]))
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite", 
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
a<-dput(names(crime_m_final[,c(SES1@selectedVars)]))
c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite", 
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)
length(a)
[1] 8
SES_list <- c("blackPerCap", "MedRentPctHousInc", "racepctblack", "racePctWhite", 
"PctPolicMinor", "PolicOperBudg", "LemasPctOfficDrugUn", "TotalPctDiv"
)

SBF de caret

filtro<-sbf(x,y,sbfControl =
              sbfControl(functions = rfSBF,
                         method = "cv", verbose = FALSE))

a<-dput(filtro$optVariables)
c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "LandArea", "PctUsePubTrans", 
"PolicOperBudg", "LemasPctPolicOnPatr", "LemasPctOfficDrugUn", 
"TotalPctDiv", "PersPerFam", "PctWorkMom")
length(a)
[1] 30
SBF_list <- c("perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "LandArea", "PctUsePubTrans", 
"PolicOperBudg", "LemasPctPolicOnPatr", "LemasPctOfficDrugUn", 
"TotalPctDiv", "PersPerFam", "PctWorkMom")

RFE

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

results <- rfe(x, y, sizes=c(1:8), rfeControl=control)

selecrfe<-results$optVariables
length(selecrfe)
[1] 33
dput(selecrfe)
c("racePctWhite", "TotalPctDiv", "racepctblack", "pctWWage", 
"PctImmigRecent", "PctWorkMom", "PctBornSameState", "LandArea", 
"AsianPerCap", "blackPerCap", "perCapInc", "indianPerCap", "PersPerFam", 
"PctSameHouse85", "OtherPerCap", "PctSpeakEnglOnly", "MedRentPctHousInc", 
"MedOwnCostPctIncNoMtg", "agePct12t29", "PctUsePubTrans", "PctForeignBorn", 
"PctPolicMinor", "PolicAveOTWorked", "PolicPerPop", "LemasPctOfficDrugUn", 
"LemasPctPolicOnPatr", "PctPolicWhite", "PolicOperBudg", "PolicReqPerOffic", 
"NumKindsDrugsSeiz", "RacialMatchCommPol", "PctPolicAsian", "LemasGangUnitDeploy"
)
RFE_list <-c("racePctWhite", "TotalPctDiv", "racepctblack", "pctWWage", 
"PctImmigRecent", "PctBornSameState", "LandArea", "AsianPerCap", 
"PctWorkMom", "perCapInc", "blackPerCap", "indianPerCap", "OtherPerCap", 
"PctSpeakEnglOnly", "agePct12t29", "PctSameHouse85", "PersPerFam", 
"MedRentPctHousInc", "MedOwnCostPctIncNoMtg", "PctUsePubTrans", 
"PctForeignBorn", "PctPolicMinor", "PolicAveOTWorked", "LemasPctPolicOnPatr", 
"PolicPerPop", "PolicOperBudg", "LemasPctOfficDrugUn", "PctPolicWhite", 
"PolicReqPerOffic", "RacialMatchCommPol", "NumKindsDrugsSeiz", 
"PctPolicAsian", "LemasGangUnitDeploy")

Let’s compare the algorithms with their different amounts of variables chosen.

Comparison of boxplots under logistic

crime_m_final2 <- crime_m_final
crime_m_final2$murder<-ifelse(crime_m_final2$murder==1,"Yes","No")
crime_m_final2$murder<-factor(crime_m_final2$murder)
crime_m_final2  <- na.omit(crime_m_final2 )
crime_m_final2  <-as_tibble(crime_m_final2 )
crime_m_final2  <- as.data.frame(crime_m_final2 )

data2<- crime_m_final2 

medias1<-cruzadalogistica(data=data2, 
                           vardep="murder",listconti=AIC_list,
                           listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias1 <- as.data.frame(medias1[[1]])  
medias1$modelo="AIC"


medias2<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=BIC_list,
                    listclass=c(""),grupos=4,sinicio=1234,repe=25)
medias2 <- as.data.frame(medias2[[1]])  
medias2$modelo="BIC"

medias3<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=
                       Boruta_list,
                    listclass=c(""),grupos=4,sinicio=1234,repe=25)

medias3 <- as.data.frame(medias3[[1]])  
medias3$modelo="Boruta"

medias4<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=RFE_list,
                    listclass=c(""),grupos=4,sinicio=1234,repe=25)

medias4 <- as.data.frame(medias4[[1]])  
medias4$modelo="RFE"

medias5<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=
                       SBF_list,
                      listclass=c(""),grupos=4,sinicio=1234,repe=25)

medias5 <- as.data.frame(medias5[[1]])  
medias5$modelo="SBF"

medias6<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=
                       MXM_list,
                      listclass=c(""),grupos=4,sinicio=1234,repe=25)

medias6 <- as.data.frame(medias6[[1]])  
medias6$modelo="MXM"


medias7<-cruzadalogistica(data=data2,
                    vardep="murder",listconti=
                       SES_list,
                      listclass=c(""),grupos=4,sinicio=1234,repe=25)

medias7 <- as.data.frame(medias7[[1]])  
medias7$modelo="SES"

union1<-rbind(medias1,medias2, medias3,
              medias4,medias5, 
              medias6, medias7)

par(cex.axis=0.7)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
par(cex.axis=0.7)
boxplot(data=union1,col="blue",auc~modelo,main="AUC")

It can be seen that the best model is AIC, so we will choose its variables.

choose variables of AIC: c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”, “MedRentPctHousInc”, “PolicAveOTWorked”, “PersPerFam”, “PolicPerPop”, “racepctblack”, “PctBornSameState”, “PctPolicWhite”, “indianPerCap”, “PctWorkMom”)

Model Num. Variables Variables
AIC 17 c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”, “MedRentPctHousInc”,“PolicAveOTWorked”, “PersPerFam”, “PolicPerPop”, “racepctblack”, “PctBornSameState”, “PctPolicWhite”, “indianPerCap”, “PctWorkMom”)
Wrapper Boruta 33 c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “MedOwnCostPctIncNoMtg”, “racepctblack”,“racePctWhite”, “agePct12t29”, “PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”,“PolicReqPerOffic”, “PolicPerPop”,“RacialMatchCommPol”, “PctPolicWhite”,“PctPolicAsian”, “PctPolicMinor”, “NumKindsDrugsSeiz”, “PolicAveOTWorked”, “LandArea”, “PctUsePubTrans”, “PolicOperBudg”,“LandArea”, “PctUsePubTrans”, “PolicOperBudg”, “LemasPctPolicOnPatr”, “LemasGangUnitDeploy”, “LemasPctOfficDrugUn”, “TotalPctDiv”, “PersPerFam”, “PctWorkMom”)
MXM 8 c(“blackPerCap”, “MedRentPctHousInc”, “racepctblack”, “racePctWhite”, “PctPolicMinor”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “TotalPctDiv”)
Wrapper SES 8 c(“blackPerCap”, “MedRentPctHousInc”, “racepctblack”, “racePctWhite”, “PctPolicMinor”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “TotalPctDiv”)
SBF 30 c(“perCapInc”, “blackPerCap”, “indianPerCap”, “AsianPerCap”, “OtherPerCap”, “pctWWage”, “MedRentPctHousInc”, “MedOwnCostPctIncNoMtg”, “racepctblack”, “racePctWhite”,“agePct12t29”, “PctImmigRecent”, “PctSpeakEnglOnly”, “PctForeignBorn”, “PctBornSameState”, “PctSameHouse85”, “PolicReqPerOffic”, “PolicPerPop”, “RacialMatchCommPol”, “PctPolicWhite”, “PctPolicAsian”, “PctPolicMinor”, “LandArea”, “PctUsePubTrans”, “PolicOperBudg”, “LemasPctPolicOnPatr”, “LemasPctOfficDrugUn”, “TotalPctDiv”,“PersPerFam”, “PctWorkMom”)
RFE 33 c(“racePctWhite”, “TotalPctDiv”, “racepctblack”, “pctWWage”, “PctImmigRecent”, “PctBornSameState”, “LandArea”, “AsianPerCap”, “PctWorkMom”, “perCapInc”,“blackPerCap”, “indianPerCap”, “OtherPerCap”, “PctSpeakEnglOnly”, “agePct12t29”, “PctSameHouse85”, “PersPerFam”, “MedRentPctHousInc”,“MedOwnCostPctIncNoMtg”, “PctUsePubTrans”,“PctForeignBorn”, “PctPolicMinor”, “PolicAveOTWorked”, “LemasPctPolicOnPatr”, “PolicPerPop”, “PolicOperBudg”, “LemasPctOfficDrugUn”, “PctPolicWhite”,“PolicReqPerOffic”, “RacialMatchCommPol”, “NumKindsDrugsSeiz”, “PctPolicAsian”, “LemasGangUnitDeploy”)
BIC 8 c(“racePctWhite”, “TotalPctDiv”, “pctWWage”, “blackPerCap”, “PolicOperBudg”, “PctPolicMinor”, “LemasPctOfficDrugUn”, “PctForeignBorn”)
variables<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")

data2<-data2[,c(variables,vardep)]

Models

Neural Network

Now that we have decided which variables to go with, let’s look for the best parameters of the network, first without and then with early stopping. First of all it will be important to calculate the number of nodes with the formula: N of parameters =h(k+1)+h+1. We have 2908 observations and 17 variables and we want 30 observations per parameter. That implies that we will have 2908/30= 97 parameters.

With 5 hidden nodes, we will need at least 2880 observations:

5(17+1)+5+1 = 96 -> 96x30= 2880

with 7 nodes it would be 4020 observations.

Hopefully we will find an adequate number of nodes to avoid over-fitting (too many nodes) or under-fitting (too few nodes). In the next section we will tune our network with repeated cross validation with avnnet.

control<-trainControl(method = "repeatedcv",
                      number=4,savePredictions = "all"
                      ,classProbs = TRUE, repeats=4,
                      summaryFunction=twoClassSummary
                      ) 

set.seed(123)
nnetgrid <-  expand.grid(size=c(5, 6,8, 10),decay=c(0.1, 0.01, 0.001),bag=F)

completo<-data.frame()
listaiter<-c(1, 10, 20, 50, 70)


for (iter in listaiter)
{
  rednnet<- train(murder~.,
                  data=data2,
                  method="avNNet",linout = FALSE,maxit=iter,
                  trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
  rednnet$results$itera<-iter
  completo<-rbind(completo,rednnet$results)
  
  
}

completo<-completo[order(completo$ROC),]

ggplot(completo, aes(x=factor(itera), y=ROC, 
                     color=factor(decay),pch=factor(size))) +
  geom_point(position=position_dodge(width=0.5),size=3)

We want a ROC as high as possible with the least iterations possible. We see that between 50 and 70 iterations the curve flattens. So let’s concentrate on this range. We will increase the size to see whether we can get better results.

control<-trainControl(method = "repeatedcv",
                      number=4,savePredictions = "all"
                      ,classProbs = TRUE, repeats=4,
                      summaryFunction=twoClassSummary
                      ) 

set.seed(123)
nnetgrid <-  expand.grid(size=c(8, 10, 15, 20),decay=c(0.1, 0.01, 0.001),bag=F)

completo<-data.frame()
listaiter<-c( 60, 70, 80)


for (iter in listaiter)
{
  rednnet<- train(murder~.,
                  data=data2,
                  method="avNNet",linout = FALSE,maxit=iter,
                  trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
  rednnet$results$itera<-iter
  completo<-rbind(completo,rednnet$results)
  
  
}

completo<-completo[order(completo$ROC),]

ggplot(completo, aes(x=factor(itera), y=ROC, 
                     color=factor(decay),pch=factor(size))) +
  geom_point(position=position_dodge(width=0.5),size=3)

We can still see that the performance is increasing so we are choosing a bigger size and more iterations.

control<-trainControl(method = "repeatedcv",
                      number=4,savePredictions = "all"
                      ,classProbs = TRUE, repeats=4,
                      summaryFunction=twoClassSummary
                      ) 

set.seed(13)
nnetgrid <-  expand.grid(size=c(15, 20, 25, 30),decay=c(0.1, 0.01, 0.001),bag=F)

completo<-data.frame()
listaiter<-c( 80, 90, 100)


for (iter in listaiter)
{
  rednnet<- train(murder~.,
                  data=data2,
                  method="avNNet",linout = FALSE,maxit=iter,
                  trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F, metric='ROC')
  rednnet$results$itera<-iter
  completo<-rbind(completo,rednnet$results)
  
  
}

completo<-completo[order(completo$ROC),]

ggplot(completo, aes(x=factor(itera), y=ROC, 
                     color=factor(decay),pch=factor(size))) +
  geom_point(position=position_dodge(width=0.5),size=3)

The best number of iterations is 101, the best learning rate 0.01 and the best number of nodes is 30. We leave it tuned like that and do early stopping so that it is not overfitted. But before we compare it to the previous models, we are going to see whether we get a better performance with the mentioned parameters and a maxit of 100.

set.seed(12)
nnetgrid <- expand.grid(size=c(30),decay=c(0.01),bag=F)
completo<-data.frame()
listaiter<-c(101)
for (iter in listaiter)
{
rednnet<- train(murder~.,
data=data2,
method="avNNet",linout = FALSE,maxit=100,
trControl=control,repeats=5,tuneGrid=nnetgrid,trace=F)
rednnet$results$itera<-iter
completo<-rbind(completo,rednnet$results)
}
completo<-completo[order(completo$ROC),]
ggplot(completo, aes(x=factor(itera), y=ROC,
color=factor(decay),pch=factor(size))) +
geom_point(position=position_dodge(width=0.5),size=3)

The ROC is not higher with the maxit at 100, so we will continue with the network without including the maxit.

Comparison

data2<-crime_m_final2[,c(AIC_list, "murder")]

medias8<-cruzadaavnnetbin(data=data2, vardep="murder",
listconti=AIC_list, listclass=c(""),
grupos=4, sinicio=1234, repe=25,
repeticiones=5, itera=101, size=c(30),
decay=c(0.01))
  size decay   bag  Accuracy     Kappa AccuracySD    KappaSD
1   30  0.01 FALSE 0.8616215 0.7232425 0.01392548 0.02785255
medias8 <- as.data.frame(medias8[[1]]) 
medias8$modelo="Network"
union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8)

par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
par(cex.axis=0.7)
boxplot(data=union1,col="blue",auc~modelo,main="AUC")

Thanks to the tuning done, the neuronal network is the best models now by far.

Random forest & bagging

We will continue with the Bagging algorithm and find out which sampsize and which ntree are optimal by performing a tuning. For this we will do a repeated cross-validation with caret. We will select the value of mtry equal to the number of independent variables we have obtained from the AIC model. In addition to Accuracy, we will also compare the means of Kappa. Kappa is important because it tells us how reliable or consistent our ratings are.It helps us assess whether the evaluators are in general agreement or whether their decisions are more random. That is, a high Kappa value helps to avoid incorrect interpretations or conclusions due to random agreement. Therefore, it would be best to have a Kappa superior to 0.6.

Sampsize

It is important to note that sampsize is the parameter that allows us to achieve class balance. As seen above, the balance of the values in target variable murder are perfect due to oversampling. Nevertheless we will take it into account to further improve our model.To determine its appropriate value, we will perform a repeated cross-validation with different values in the range of 10 to 200, since the minimum number of sampsize should not be too high, to avoid overfitting.

set.seed(23)
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 10, 50, 100, 200 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
10  0.7637363 0.7812930 0.7866482 0.7870673 0.7949863 0.8076923    0
50  0.7744154 0.7958068 0.7991743 0.8010312 0.8077717 0.8214286    0
100 0.7689133 0.8002063 0.8027495 0.8036451 0.8088843 0.8241758    0
200 0.7867950 0.8069516 0.8136176 0.8133429 0.8195895 0.8337912    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
10  0.5274725 0.5625760 0.5733081 0.5741383 0.5899725 0.6153846    0
50  0.5488402 0.5916545 0.5983486 0.6020640 0.6155451 0.6428571    0
100 0.5378328 0.6004251 0.6054815 0.6072933 0.6177686 0.6483516    0
200 0.5736006 0.6139031 0.6272198 0.6266895 0.6391790 0.6675824    0

We can see that the best sampsize is 200, since Accuracy and Kappa are the highest. Kappa’s mean is above 0.6 and Accuracy barely surpasses 0.8. It might be the case that Kappa and Accuracy will still increase with an even bigger sampsize. Therefore, we are going to tune by choosing parameters above 200.

set.seed(234)

#message("Numero mínimo en sampsize es ",(nmin*0.75))
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c( 200, 250, 300, 350, 400)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 200, 250, 300, 350, 400 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
200 0.7867950 0.8069516 0.8136176 0.8133429 0.8195895 0.8337912    0
250 0.7936726 0.8134258 0.8184316 0.8168490 0.8202754 0.8337912    0
300 0.7977992 0.8174640 0.8239355 0.8215279 0.8267906 0.8324176    0
350 0.8005502 0.8197454 0.8254295 0.8252409 0.8333906 0.8420330    0
400 0.8060523 0.8250261 0.8321871 0.8297127 0.8347107 0.8415978    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
200 0.5736006 0.6139031 0.6272198 0.6266895 0.6391790 0.6675824    0
250 0.5873570 0.6268389 0.6368631 0.6337012 0.6405507 0.6675824    0
300 0.5956083 0.6349276 0.6478711 0.6430595 0.6535813 0.6648352    0
350 0.6011102 0.6394954 0.6508844 0.6504850 0.6667789 0.6840659    0
400 0.6121170 0.6500269 0.6644003 0.6594296 0.6694215 0.6831956    0

Highest Kappa and Accuracy seem to be at sampsize 400, which means we can tune one more round to see whether they keep increasing with a higher sampsize.

set.seed(234)

#message("Numero mínimo en sampsize es ",(nmin*0.75))
nmin <- min(table(data2$murder))
mtry <- c(length(AIC_list))
control<-trainControl(method = "repeatedcv", number=4,
allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid', classProbs=TRUE
#, summaryFunction=twoClassSummary
)
rfgrid <- expand.grid(.mtry = mtry)
modellist <- list()
sizes <- c( 400, 450, 500, 550, 600)
for (sampsize in sizes){
set.seed(234)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500, nodesize=20,
metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 400, 450, 500, 550, 600 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
400 0.8060523 0.8250261 0.8321871 0.8297127 0.8347107 0.8415978    0
450 0.8115543 0.8295530 0.8334480 0.8328750 0.8393536 0.8489011    0
500 0.8184319 0.8296118 0.8363152 0.8365870 0.8444061 0.8530220    0
550 0.8143054 0.8371686 0.8430832 0.8415407 0.8464187 0.8571429    0
600 0.8129298 0.8384331 0.8430832 0.8436055 0.8512397 0.8624484    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
400 0.6121170 0.6500269 0.6644003 0.6594296 0.6694215 0.6831956    0
450 0.6231179 0.6591060 0.6669164 0.6657546 0.6787149 0.6978022    0
500 0.6368769 0.6592279 0.6726304 0.6731782 0.6888142 0.7060440    0
550 0.6286227 0.6743470 0.6861590 0.6830857 0.6928375 0.7142857    0
600 0.6258647 0.6768941 0.6861566 0.6872148 0.7024793 0.7249067    0

The best sampsize is 600, with a Accuracy of 0.84 and a Kappa 0.69.

ntress

We will use sampsize 600 to determine the number of trees to be included in the bagging.

ntrees <- c(50, 100, 200, 300)
modellist <- list()
for (ntree in ntrees){
set.seed(12345)
rf<- train(factor(murder)~.,data=data2,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(600, 600))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 50, 100, 200, 300 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
50  0.8090659 0.8335626 0.8375783 0.8377621 0.8433542 0.8569464    0
100 0.8090659 0.8363136 0.8452545 0.8417500 0.8487999 0.8571429    0
200 0.8131868 0.8361384 0.8446744 0.8424377 0.8508953 0.8598901    0
300 0.8173077 0.8376891 0.8432998 0.8422312 0.8512908 0.8557692    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
50  0.6181319 0.6671182 0.6751780 0.6755258 0.6867119 0.7139073    0
100 0.6181319 0.6726551 0.6905148 0.6835015 0.6975997 0.7142857    0
200 0.6263736 0.6722766 0.6893485 0.6848757 0.7017906 0.7197802    0
300 0.6346154 0.6753838 0.6865953 0.6844631 0.7025838 0.7115385    0

We can see the Accuracy is above 0.8 which is good, and the Kappa is even close 0.7. However, there is a growing tendency, wherefore ntrees above 300 will be tested.

data3 <- data2
ntrees <- c(300, 400, 500)
set.seed(125)

modellist <- list()
for (ntree in ntrees){
set.seed(1225)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=ntree,
nodesize=20, metric='ROC', sampsize=c(600, 600))
key <- toString(ntree)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 300, 400, 500 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
300 0.8250689 0.8367147 0.8445667 0.8443571 0.8521320 0.8640110    0
400 0.8236915 0.8329329 0.8486933 0.8446326 0.8543886 0.8626374    0
500 0.8236915 0.8337341 0.8480055 0.8454571 0.8554216 0.8640110    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
300 0.6501377 0.6734288 0.6891433 0.6887145 0.7042657 0.7280220    0
400 0.6473829 0.6658626 0.6973882 0.6892642 0.7087751 0.7252747    0
500 0.6473829 0.6674683 0.6960168 0.6909142 0.7108423 0.7280220    0

We are going to choose ntrees = 500, since it has the highest Accuracy and Kappa.

Random forest

We will start with the selection of mtry, which refers to the number of variables randomly selected in each split when constructing each tree in the ensemble. Choosing the right value of mtry is important because it affects the randomness and diversity of the randomforest trees. If mtry is too low, each tree may have a limited view of the available features, which may lead to insufficient fit. On the other hand, if mtry is too high, trees may be more similar, increasing the risk of underfitting. We will use the ntrees found with Bagging (500).

set.seed(12345)
rfgrid<-expand.grid(mtry=c(2, 3, 5, 7, 9, 10, 20))
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,ntree=500,nodesize=20,replace=TRUE,
importance=T)
rf
Random Forest 

2908 samples
  17 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times) 
Summary of sample sizes: 2180, 2182, 2181, 2181, 2181, 2181, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8426452  0.6852884
   3    0.8519298  0.7038576
   5    0.8567442  0.7134864
   7    0.8588074  0.7176132
   9    0.8599077  0.7198143
  10    0.8597002  0.7193990
  20    0.8588067  0.7176132

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 9.

For mtry=9 Accuracy is well above 0.8 and Kappa even surparses 0.7. These are already good measures, through a line chart we can visualize whether the performance continues to increase with a growing mtry.

The line flattens out at mtry=12. We will choose this value since it is more advantageous to have a smaller mtry. Mtry indicates the amount of variables during the split of each node that are being taken into consideration. Having a low mtry is benificial for the diversity since at each node it is more likely that different variables are being considered (since a smaller number is picked), ensuring that there is a difference among the trees in the forest. This increases the randomness of the data and ultimately the capability to generalize.

Now we will find out the correct amount of sampsize for the Random Forest. For this we will use the mtry (12) and the ntree (500) obtained.

set.seed(12345)
rfgrid<-expand.grid(.mtry=12)
modellist <- list()
sizes <- c(10, 50, 100, 200)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 10, 50, 100, 200 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
10  0.7637363 0.7810247 0.7840440 0.7867299 0.7924264 0.8060523    0
50  0.7706044 0.7952593 0.7990364 0.7995892 0.8033747 0.8186813    0
100 0.7867950 0.7988308 0.8068713 0.8066740 0.8154270 0.8236915    0
200 0.7870879 0.8084594 0.8121133 0.8131405 0.8204341 0.8296703    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
10  0.5274725 0.5620152 0.5680872 0.5734565 0.5848464 0.6121317    0
50  0.5412088 0.5905281 0.5980648 0.5991799 0.6067493 0.6373626    0
100 0.5736006 0.5976468 0.6137556 0.6133475 0.6308540 0.6473829    0
200 0.5741758 0.6169213 0.6242340 0.6262811 0.6408503 0.6593407    0

Accuracy is already slightly above 0.8 and Kappa above the threshold of 0.6. But it seems like their values can still rise with a higher sampsize.

set.seed(12345)
rfgrid<-expand.grid(.mtry=12)
modellist <- list()
sizes <- c(300, 400, 500, 600)
for (sampsize in sizes){
set.seed(12345)
rf<- train(factor(murder)~.,data=data3,
method="rf",trControl=control,tuneGrid=rfgrid,
linout = FALSE,replace=TRUE, importance=T, ntree=500,
nodesize=20,metric='ROC', sampsize=c(sampsize,sampsize))
key <- toString(sampsize)
modellist[[key]] <- rf
}
#Compare results
results <- resamples(modellist)
summary(results)

Call:
summary.resamples(object = results)

Models: 300, 400, 500, 600 
Number of resamples: 20 

Accuracy 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
300 0.7939560 0.8112756 0.8198074 0.8194013 0.8288064 0.8418157    0
400 0.7994505 0.8219326 0.8280605 0.8280655 0.8340220 0.8502747    0
500 0.8118132 0.8282377 0.8349381 0.8350105 0.8426839 0.8502747    0
600 0.8090659 0.8346514 0.8412375 0.8414771 0.8503615 0.8622590    0

Kappa 
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
300 0.5879121 0.6225347 0.6396022 0.6388017 0.6576269 0.6836511    0
400 0.5989011 0.6438757 0.6561360 0.6561301 0.6680441 0.7005495    0
500 0.6236264 0.6564768 0.6698743 0.6700209 0.6853647 0.7005495    0
600 0.6181319 0.6692929 0.6824729 0.6829540 0.7007183 0.7245179    0

Accuracy and Kappa seem to be the highest at sampsize 600.

Now we will compare bagging and random forest with our previous models.

Comparison

medias9<-cruzadarfbin(data=data2,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=length(AIC_list),ntree=500,replace=TRUE,sampsize=600)
  mtry  Accuracy     Kappa AccuracySD    KappaSD
1   17 0.8212106 0.6424197 0.01410735 0.02821521
medias9 <- as.data.frame(medias9[[1]])
medias9$modelo <- "Bagging"

medias10<-cruzadarfbin(data=data2,
vardep="murder",listconti= AIC_list,
listclass=c(""),grupos=4,sinicio=12345,repe=25,
nodesize=20,mtry=12,ntree=500,replace=TRUE,sampsize=600)
  mtry  Accuracy     Kappa AccuracySD    KappaSD
1   12 0.8206882 0.6413748 0.01313456 0.02627187
medias10 <- as.data.frame(medias10[[1]])
medias10$modelo <- "RF"


union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10)

par(cex.axis=0.8)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
par(cex.axis=0.7)
boxplot(data=union1,col="blue",auc~modelo,main="AUC")

As can be seen they are among the highest performing models. However, Neuronal Network is still considerably better.

Gradient Boosting

We will continue with the Gradient Boosting model. This is a method that combines multiple weak learners (decision trees) to create a robust prediction model. The final model prediction is obtained by aggregating the predictions of all the individual models in the ensemble.

To develop the best possible ensemble, we will take into account Accuracy and Kappa and tune four parameters:

  1. tree depth / interaction.depth: depth level of the trees.
  2. n.trees: number of trees
  3. shrinkage: control of the contribution of each tree in the ensemble (helps to avoid overfitting)
  4. n.minobsinnode: observations at end of nodes
GS_T0 <- Sys.time()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

  control<-trainControl(method = "repeatedcv",number=4,repeats=3,
   savePredictions = "all",classProbs=TRUE) 

    set.seed(23)
    
   gbmgrid <-expand.grid(n.minobsinnode=c(5,10,15),
    shrinkage=c(0.1,0.01),n.trees=c(50, 100,200, 300),
    interaction.depth=c(2, 4, 6, 8))
  
  gbm<- train(murder~.,data=data3,
   method="gbm",trControl=control,
   tuneGrid=gbmgrid,distribution="bernoulli",verbose=FALSE)
  
  plot(gbm)
  gbm$bestTune
   n.trees interaction.depth shrinkage n.minobsinnode
92     300                 8       0.1             10
  sorted_results <- gbm$results[order(-gbm$results$Accuracy), ]
print(sorted_results[1, c("Accuracy", "Kappa")])
    Accuracy     Kappa
92 0.8836566 0.7673131

We can see that the best model has a Kappa of 0.77 and an Accuracy of 0.88. The parameters of this model (number 92) are ntrees=300, interaction.depth=8, shrinkage=0.1 and n.minobsinnode = 10.

Lets use these parameters to see whether a tune of bag.fraction (which is the draw of observations) gives us a better performance of the model. For this parameter we will choose numbers 0.8 and 0.5.

Tuning with bag.fraction

set.seed(3)
gbmgrid <-expand.grid(n.minobsinnode=c(10),
shrinkage=c(0.1),n.trees=c(300),
interaction.depth=c(8))
set.seed(3)

gbm<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid,
            method="gbm",trControl=control,distribution="bernoulli",
            bag.fraction=1,verbose=FALSE)
  
gbm1<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.8,verbose=FALSE)

gbm2<- train(factor(murder)~.,data=data3,tuneGrid=gbmgrid, method="gbm",trControl=control,distribution="bernoulli", bag.fraction=0.5,verbose=FALSE)

sorted_results1 <- gbm1$results[order(-gbm1$results$Accuracy), ]

sorted_results <- gbm2$results[order(-gbm2$results$Accuracy), ]
                                  
combined_results <- rbind(sorted_results, sorted_results1)

combined_results
  n.minobsinnode shrinkage n.trees interaction.depth  Accuracy
1             10       0.1     300                 8 0.8747110
2             10       0.1     300                 8 0.8744859
      Kappa AccuracySD    KappaSD
1 0.7494199 0.01612700 0.03225362
2 0.7489688 0.01239853 0.02480232

We see that having a bag.fraction of 0.8 or 0.5 does not bring us much benefit in terms of Accuracy and Kappa, so we will continue without this parameter.

Comparison

medias11<-cruzadagbmbin(data=data3,
                          vardep="murder",listconti=AIC_list,
                          listclass=c(""),grupos=4,sinicio=12345,repe=25,
                        n.minobsinnode=c(10),shrinkage=c(0.1),n.trees=c(300),
                        interaction.depth=c(8))
  n.minobsinnode shrinkage n.trees interaction.depth  Accuracy
1             10       0.1     300                 8 0.8781428
     Kappa AccuracySD    KappaSD
1 0.756285 0.01109668 0.02219527
medias11 <- as.data.frame(medias11[[1]])
medias11$modelo <- "GBM"

union1<-rbind(medias1,medias2, medias3, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11)
par(cex.axis=0.8)

boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
par(cex.axis=0.7)
boxplot(data=union1,col="blue",auc~modelo,main="AUC")

GBM proves to be the best model now.

SVM

We continue with the Support Vector Machine. The goal of this model is to find the best way to separate the data in different classes with a line, so we are going to use three types of kernel: linear, polynomial, RBF (radial basis function) and tuning the parameters C, Degree and Sigma.

Linear

For linear SVM only a suitable C needs to be found. This parameter is necessary to control the balance between bias and overfitting.

set.seed(1234)
  control<-trainControl(method = "repeatedcv",number=4,repeats=5,
   savePredictions = "all",classProbs=TRUE) 
  
  # Aplico caret y construyo modelo
  
  SVMgrid <-expand.grid(C=c(0.5,1,2,5,10,30))
  
  SVMlin<- train(murder~.,data=data3,
   method="svmLinear",trControl=control,
   tuneGrid=SVMgrid,replace=replace)
  
  print(SVMlin$results)
     C  Accuracy     Kappa AccuracySD    KappaSD
1  0.5 0.8025430 0.6050879 0.01264452 0.02529202
2  1.0 0.8018555 0.6037118 0.01303748 0.02607800
3  2.0 0.8024745 0.6049512 0.01276444 0.02553110
4  5.0 0.8017176 0.6034362 0.01234997 0.02470681
5 10.0 0.8026119 0.6052261 0.01320621 0.02641420
6 30.0 0.8021310 0.6042636 0.01241081 0.02482416

We can see that all Cs surpass an Accuracy of 0.8, but Kappa is just slightly bigger than 0.6. C =10 seems to have the highest Accuracy and Kappa. Lets choose some parameters around 10 and visualize the results.

  set.seed(13)
SVMgrid<-expand.grid(C=c(7, 9, 10,12, 15))
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
repeats = 5, savePredictions = "all",search = 'grid')

SVMlin<- train(murder~.,data=data3,
method="svmLinear",trControl=control,
tuneGrid=SVMgrid,verbose=FALSE)

completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Accuracy),]
ggplot(completo, aes(x=factor(C), y=Accuracy)) +
geom_point(position=position_dodge(width=0.5),size=3)
completo<-data.frame()
completo<-rbind(completo,SVMlin$results)
completo<-completo[order(completo$Kappa),]
ggplot(completo, aes(x=factor(C), y=Kappa)) +
geom_point(position=position_dodge(width=0.5),size=3)

We can see that the best C is C=7 for SMV Linear.

SVM RBF

Apart from C, we are also going to find out which is the best sigma. Sigma refers to the width of the kernel. A wider kernel means that more distant points are considered more similar, the opposite is true for a narrow kernel.

set.seed(123456)
SVMgrid<-expand.grid(C=c(1, 5,10, 15, 20),
                     sigma=c(0.0001,0.005,0.01,0.05,0.7))

control<-trainControl(method = "cv",
                      number=4,savePredictions = "all")

SVMRBF<- train(murder~.,data=data3,
            method="svmRadial",trControl=control,
            tuneGrid=SVMgrid,verbose=FALSE)


sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]

print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
    Accuracy     Kappa  C sigma
10 0.8720826 0.7441602  5   0.7
15 0.8703656 0.7407260 10   0.7
25 0.8693306 0.7386570 20   0.7

All the top 3 results have a Kappa higher than 0.7 and an Accuracy close to 0.9. Lets see whether there are better Cs and sigmas close to these parameters of the three models and let’s also visualize it.

set.seed(1234)
SVMgrid<-expand.grid(C=c(3, 5, 7, 10, 12, 15, 20, 25),
                     sigma=c(0.7, 0.9, 1, 1.5, 2, 2.5))

 control<-trainControl(method = "repeatedcv",
                       number=4,savePredictions = "all")

SVMRBF <- train(murder~.,data=data3,
            method="svmRadial",trControl=control,
            tuneGrid=SVMgrid,verbose=FALSE)

 sorted_results <- SVMRBF$results[order(-SVMRBF$results$Accuracy), ]

print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "sigma")])
    Accuracy     Kappa C sigma
10 0.8810088 0.7620187 5   1.5
16 0.8806654 0.7613319 7   1.5
4  0.8806649 0.7613313 3   1.5
completo<-data.frame()

completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Kappa),]

ggplot(completo, aes(x=factor(C), y=Kappa, 
                     color=factor(sigma))) +
  geom_point(position=position_dodge(width=0.5),size=3)
completo<-rbind(completo,SVMRBF$results)
completo<-completo[order(completo$Accuracy),]

ggplot(completo, aes(x=factor(C), y=Accuracy, 
                     color=factor(sigma))) +
  geom_point(position=position_dodge(width=0.5),size=3)

The best sigma is 1.5 and the best C is 5 for the best Accuracy and Kappa

Polynomial

For this kernel we will use degree and scale apart from C. Degree refers to the flexibility and shape of the curve separating the different data classes. A degree that is too high can lead to overfitting, as the curve will be more complex.

set.seed(1235)
 
control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE, 
                      savePredictions = "all",search = 'grid') 

  SVMgrid <-expand.grid(C=c(0.1, 0.3),degree=c(2,3,4),scale=c(1, 2, 3))
  
  SVM_poli<- train(murder~.,data=data3,
   method="svmPoly",trControl=control,
   tuneGrid=SVMgrid,replace=replace)
  
sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]

print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
    Accuracy     Kappa   C degree scale
16 0.8359710 0.6719487 0.3      4     1
7  0.8342516 0.6685107 0.1      4     1
9  0.8253117 0.6506336 0.1      4     3

We can see that the model 16 seems to be the best. Lets see if we can find a higher Accuracy and Kappa with similar parameters.

set.seed(1234)

control<-trainControl(method = "repeatedcv",number=4, allowParallel = TRUE,
                      savePredictions = "all",search = 'grid')

  SVMgrid <-expand.grid(C=c( 0.3, 0.5, 0.8, 1),degree=c(4, 6, 8),scale=c(1, 1.5, 2))

  SVM_poli<- train(murder~.,data=data3,
   method="svmPoly",trControl=control,
   tuneGrid=SVMgrid,replace=replace)

sorted_results <- SVM_poli$results[order(-SVM_poli$results$Accuracy), ]

# Print the top 3 rows with highest accuracy and kappa values
print(sorted_results[1:3, c("Accuracy", "Kappa", "C", "degree", "scale")])
    Accuracy     Kappa   C degree scale
4  0.8315006 0.6630115 0.3      6     1
13 0.8315006 0.6630115 0.5      6     1
22 0.8315006 0.6630115 0.8      6     1
set.seed(1235)

completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Accuracy),]

plot1 <- ggplot(completo, aes(x=factor(C), y=Accuracy,
                     color=factor(scale),pch=factor(degree))) +
  geom_point(position=position_dodge(width=0.5),size=3)
plot1
completo<-data.frame()
completo<-rbind(completo,SVM_poli$results)
completo<-completo[order(completo$Kappa),]

plot1 <- ggplot(completo, aes(x=factor(C), y=Kappa,
                     color=factor(scale),pch=factor(degree))) +
  geom_point(position=position_dodge(width=0.5),size=3)
plot1

For SVM Polinomial we choose the parameters C=0.28, degree=6, scale=1

Comparison

medias12<-cruzadaSVMbin(data=data3,
                          vardep="murder",listconti= AIC_list,
                          listclass=c(""),grupos=4,sinicio=12345,repe=25,C=7)
  C  Accuracy     Kappa AccuracySD    KappaSD
1 7 0.8007431 0.6014852 0.01245616 0.02491299
medias12 <- as.data.frame(medias12[[1]])
medias12$modelo="SVM_lineal"


medias13<-cruzadaSVMbinRBF(data=data3,
                          vardep="murder",listconti= AIC_list,
                          listclass=c(""),grupos=4,sinicio=12345,repe=25,
                           C=5,sigma=1.5)
  C sigma  Accuracy     Kappa AccuracySD    KappaSD
1 5   1.5 0.8778536 0.7557054 0.01327893 0.02656247
medias13 <- as.data.frame(medias13[[1]])
medias13$modelo="SVM_RBF"

medias14<- cruzadaSVMbinPoly(data=data3,
                          vardep="murder",listconti= AIC_list,
                          listclass=c(""),grupos=4,sinicio=12345,repe=25,
                           C=0.28,degree=6,scale=1)
line search fails 5.837733e-07 -0.0004815689 -3.641992e-05 -9.81476e-12 2.724726e-17 5.016644e-15 -9.923923e-22     C degree scale  Accuracy     Kappa AccuracySD  KappaSD
1 0.28      6     1 0.5658596 0.1320145 0.05830089 0.116338
medias14 <- as.data.frame(medias14[[1]])
medias14$modelo="SVM_Poli"


union1<-rbind(medias1,medias2,medias3,medias4, medias4, medias5, medias6, medias7, medias8, medias9, medias10, medias11, medias12
              ,medias13,
              medias14
              )
par(cex.axis=0.4)
boxplot(data=union1,col="red",tasa~modelo,main="Error Rate")
par(cex.axis=0.4)
boxplot(data=union1,col="blue",auc~modelo,main="AUC")

We can see that SVM Polynomial is the worst model by far for both AUC and Error Rate. The variance of its values is very high in both cases. In contrast, SVM RBF is among the best ones.

Ensemble

As one of the last steps, we are going to do an ensemble to combine our models. We have already tuned a variety of models, so we will use the best 5 algorithms (Neuronal Network, Bagging, GBM, SVM RBF and RF).

vardep<-"murder"

listconti<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")

variables2<-c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")


data4<-data2[,c(variables2,vardep)]


 listclass<-c("")

grupos<-4
sinicio<-1234
repe<-50
set.seed(1234)

archivo_m$murder<-ifelse(archivo_m$murder==1,"Yes","No")
archivo_m <- na.omit(archivo_m)
archivo_m <-as_tibble(archivo_m)
archivo_m <- as.data.frame(archivo_m)

dput(names(archivo_m))
c("murder", "perCapInc", "blackPerCap", "indianPerCap", "AsianPerCap", 
"OtherPerCap", "pctWWage", "MedRentPctHousInc", "MedOwnCostPctIncNoMtg", 
"racepctblack", "racePctWhite", "agePct12t29", "PctImmigRecent", 
"PctSpeakEnglOnly", "PctForeignBorn", "PctBornSameState", "PctSameHouse85", 
"PolicReqPerOffic", "PolicPerPop", "RacialMatchCommPol", "PctPolicWhite", 
"PctPolicAsian", "PctPolicMinor", "NumKindsDrugsSeiz", "PolicAveOTWorked", 
"LandArea", "PctUsePubTrans", "PolicOperBudg", "LemasPctPolicOnPatr", 
"LemasGangUnitDeploy", "LemasPctOfficDrugUn", "TotalPctDiv", 
"PersPerFam", "PctWorkMom")
medias1bis<-cruzadaSVMbinRBF(data=data3,
                          vardep=vardep,listconti=listconti,
                          listclass=listclass,grupos=grupos,
                          sinicio=sinicio,repe=repe,
                          C=5,sigma=1.5)
  C sigma  Accuracy     Kappa AccuracySD    KappaSD
1 5   1.5 0.8773853 0.7547708 0.01224981 0.02449797
predi1<-as.data.frame(medias1bis[2])
predi1$svmRadial<-predi1$Yes


medias2bis<-cruzadaavnnetbin(data=data3,
                          vardep=vardep,listconti=listconti,
                          listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,repeticiones=5,itera=101,
                       size=c(30),decay=c(0.01))
  size decay   bag  Accuracy     Kappa AccuracySD   KappaSD
1   30  0.01 FALSE 0.8614919 0.7229834 0.01396391 0.0279286
predi2<-as.data.frame(medias2bis[2])
predi2$network<-predi2$Yes





medias3bis<-cruzadagbmbin(data=data3,
                       vardep=vardep,listconti=listconti,
                       listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,
                        n.minobsinnode=10,shrinkage=0.1,n.trees=300,
                        interaction.depth=8)
  n.minobsinnode shrinkage n.trees interaction.depth  Accuracy
1             10       0.1     300                 8 0.8781907
      Kappa AccuracySD    KappaSD
1 0.7563819 0.01162715 0.02325265
predi3<-as.data.frame(medias3bis[2])
predi3$gbm<-predi3$Yes




medias4bis<-cruzadarfbin(data=data3,
vardep=vardep,listconti= listconti,
                          listclass=listclass,grupos=grupos,sinicio=sinicio,repe=repe,
nodesize=20,mtry=length(AIC_list),ntree=500,replace=TRUE,sampsize=600)
  mtry  Accuracy     Kappa AccuracySD    KappaSD
1   17 0.8212026 0.6424048 0.01320181 0.02640279
predi4<-as.data.frame(medias4bis)
predi4$bagging <-predi4$Yes




medias5bis<- cruzadarfbin(data=data3,
                          vardep=vardep,listconti=listconti,
                          listclass=listclass,grupos=grupos,
                          sinicio=sinicio,repe=repe,
                      nodesize=20,mtry=12,ntree=500,replace=TRUE,sampsize=600)
  mtry Accuracy     Kappa AccuracySD    KappaSD
1   12   0.8209 0.6417999 0.01317354 0.02634695
predi5<-as.data.frame(medias5bis[2])
predi5$RF<-predi5$Yes
set.seed(123)
unipredi<-cbind(predi1, predi2, predi3, predi4, predi5)

#we eliminate duplicates
unipredi<- unipredi[, !duplicated(colnames(unipredi))]

#Combinations of 2
unipredi$predi6<-(unipredi$bagging+unipredi$network)/2
unipredi$predi7<-(unipredi$bagging+unipredi$gbm)/2
unipredi$predi8<-(unipredi$bagging+unipredi$RF)/2
unipredi$predi9<-(unipredi$bagging+unipredi$svmRadial)/2
unipredi$predi10<-(unipredi$svmRadial+unipredi$network)/2
unipredi$predi11<-(unipredi$svmRadial+unipredi$gbm)/2
unipredi$predi12<-(unipredi$network+unipredi$gbm)/2
unipredi$predi13<-(unipredi$network+unipredi$RF)/2
unipredi$predi14<-(unipredi$gbm+unipredi$RF)/2
unipredi$predi15<-(unipredi$svmRadial+unipredi$RF)/2

#Combinations of 3

unipredi$predi16<-(unipredi$gbm+unipredi$network+unipredi$svmRadial)/3
unipredi$predi17<-(unipredi$gbm+unipredi$bagging+unipredi$svmRadial)/3
unipredi$predi18<-(unipredi$gbm+unipredi$network+unipredi$bagging)/3
unipredi$predi19<-(unipredi$svmRadial+unipredi$bagging+unipredi$network)/3
unipredi$predi20<-(unipredi$bagging+unipredi$network+unipredi$RF)/3
unipredi$predi21<-(unipredi$svmRadial+unipredi$bagging+unipredi$RF)/3
unipredi$predi22<-(unipredi$RF+unipredi$bagging+unipredi$gbm)/3
unipredi$predi23<-(unipredi$svmRadial+unipredi$RF+unipredi$network)/3
unipredi$predi24<-(unipredi$svmRadial+unipredi$RF+unipredi$gbm)/3
unipredi$predi25<-(unipredi$gbm+unipredi$RF+unipredi$network)/3

#Combinations of 4

unipredi$predi26<-(unipredi$bagging+unipredi$network+unipredi$RF+unipredi$svmRadial)/4
unipredi$predi27<-(unipredi$bagging+unipredi$RF+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi28<-(unipredi$network+unipredi$RF+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi29<-(unipredi$bagging+unipredi$network+unipredi$svmRadial+unipredi$gbm)/4
unipredi$predi30<-(unipredi$bagging+unipredi$network+unipredi$RF+unipredi$gbm)/4

#All 5 

unipredi$predi31<-(unipredi$gbm+unipredi$network+unipredi$svmRadial+unipredi$RF+unipredi$bagging)/5
set.seed(1344)
dput(names(unipredi))
c("pred", "obs", "No", "Yes", "rowIndex", "C", "sigma", "Resample", 
"Fold", "Rep", "svmRadial", "size", "decay", "bag", "network", 
"n.minobsinnode", "shrinkage", "n.trees", "interaction.depth", 
"gbm", "tasa", "auc", "mtry", "bagging", "RF", "predi6", "predi7", 
"predi8", "predi9", "predi10", "predi11", "predi12", "predi13", 
"predi14", "predi15", "predi16", "predi17", "predi18", "predi19", 
"predi20", "predi21", "predi22", "predi23", "predi24", "predi25", 
"predi26", "predi27", "predi28", "predi29", "predi30", "predi31"
)
listado<-c("svmRadial", "network", "RF", "bagging","gbm", "predi6", "predi7", 
           "predi8", "predi9", "predi10", "predi11", "predi12", "predi13", "predi14", "predi15", "predi16", "predi17", "predi18", "predi19", "predi20"
           , "predi21", "predi22", "predi23", "predi24","predi25", "predi26", "predi28", "predi29", "predi30","predi31")

tasafallos<-function(x,y) {
  confu<-confusionMatrix(x,y)
  tasa<-confu[[3]][1]
  return(tasa)
}

auc<-function(x,y) {
  curvaroc<-roc(response=x,predictor=y)
  auc<-curvaroc$auc
  return(auc)
}


repeticiones<-nlevels(factor(unipredi$Rep))
unipredi$Rep<-as.factor(unipredi$Rep)
unipredi$Rep<-as.numeric(unipredi$Rep)


medias0<-data.frame(c())
for (prediccion in listado)
{
  unipredi$proba<-unipredi[,prediccion]
  unipredi[,prediccion]<-ifelse(unipredi[,prediccion]>0.5,"Yes","No")
  for (repe in 1:repeticiones)
  {
    paso <- unipredi[(unipredi$Rep==repe),]
    pre<-factor(paso[,prediccion])
    archi<-paso[,c("proba","obs")]
    archi<-archi[order(archi$proba),]
    obs<-paso[,c("obs")]
    tasa=1-tasafallos(pre,obs)
    t<-as.data.frame(tasa)
    t$modelo<-prediccion
    auc<-suppressMessages(auc(archi$obs,archi$proba))
    t$auc<-auc
    medias0<-rbind(medias0,t)
  }
}

Display best models

We display the sorted boxplots to see the Error Rate and AUC.

set.seed(1233)
medias0$modelo <- with(medias0,
                       reorder(modelo,tasa, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,tasa~modelo,col="red", main='Error Rate')
tablamedias2<-medias0 %>%
  group_by(modelo) %>%
  summarize(auc=mean(auc))     

tablamedias2<-tablamedias2[order(-tablamedias2$auc),]


medias0$modelo <- with(medias0,
                       reorder(modelo,auc, mean))
par(cex.axis=0.7,las=2)
boxplot(data=medias0,auc~modelo,col="blue", main='AUC')

We can see that the top 5 models for AUC are represented in the top 5 Error rate. It can be observed that in both cases the top 5 consists of ensembles, proving that combining our models is advantageous.

Top 5 AUC: predi11, predi16, predi15, predi9, predi10 and predi15 has the lowest Error rate of them on place 1.

Top 5 Error rate: predi15, predi9, predi11, predi10, predi16 and predi11 has the highest AUC on place 1.

Predi11 and Predi15 seem to be the best overall models.

For the top 5 AUC and Error Rate ensembles SVM Radial is present in all of them.

Top 5 Accuracy:

Rank Ensemble Model included
1 Predi11 svmRadial, gbm
2 Predi16 svmRadial, gbm, network
3 Predi15 svmRadial, RF
4 Predi9 svmRadial, bagging
5 Predi10 svmRadial, Network

Top 5 Error Rate:

Rank Ensemble Models included
1 Predi15 svmRadial, RF
2 Predi9 svmRadial, bagging
3 Predi11 svmRadial, gbm
4 Predi10 svmRadial, network
5 Predi16 svmRadial, network, gbm

We will do a hypothesis tests between Predi11 and Predi15 to see which one to choose.

Hypothesis

set.seed(232)
listamodelos<-c("predi11","predi15")
datacontraste<-medias0[which(medias0$modelo%in%listamodelos),]
# Erro Rates
res <- t.test(datacontraste$tasa ~datacontraste$modelo)
res

    Welch Two Sample t-test

data:  datacontraste$tasa by datacontraste$modelo
t = -3.3559, df = 95.564, p-value = 0.001136
alternative hypothesis: true difference in means between group predi15 and group predi11 is not equal to 0
95 percent confidence interval:
 -0.004575374 -0.001174282
sample estimates:
mean in group predi15 mean in group predi11 
           0.09471802            0.09759285 

The p-value is below 0.05. Therefore we can conclude that the difference of both Error Rates means is significantly different from zero. Predi15 is significantly better than predi11 in Error rate.

set.seed(232)
# AUC
res <- t.test(datacontraste$auc ~datacontraste$modelo) 
res

    Welch Two Sample t-test

data:  datacontraste$auc by datacontraste$modelo
t = -2.6644, df = 96.871, p-value = 0.009034
alternative hypothesis: true difference in means between group predi15 and group predi11 is not equal to 0
95 percent confidence interval:
 -0.0028801632 -0.0004210653
sample estimates:
mean in group predi15 mean in group predi11 
            0.9513732             0.9530239 

Here the same is the case. The difference of both AUC means is significantly different from zero. Therefore, as expected, the predi11 model is significantly better than predi15 in AUC.

However, the p-value for the Error Rate testing is further below 0.05 indicating that the difference in Error Rate is more significant than the difference in AUC. Therefore predi15 will be selected as best model.

Measure Predi11 Predi15 p-value
AUC 0.9530 0.9514 0.0090
Error Rate 0.0976 0.0947 0.0011

Conclusion

Confusion Matrix

The first step we will do in our conclusion part is to construct the confusion matrix based on our ensemble model predi15 The confusion matrix gives an overview of the agreement between the predictions of our model and the actual values. We will use repeated cv with 6 replicates.

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  7978  858
       Yes  746 7866
                                          
               Accuracy : 0.9081          
                 95% CI : (0.9037, 0.9123)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8161          
                                          
 Mcnemar's Test P-Value : 0.005579        
                                          
            Sensitivity : 0.9017          
            Specificity : 0.9145          
         Pos Pred Value : 0.9134          
         Neg Pred Value : 0.9029          
             Prevalence : 0.5000          
         Detection Rate : 0.4508          
   Detection Prevalence : 0.4936          
      Balanced Accuracy : 0.9081          
                                          
       'Positive' Class : Yes             
                                          
Prediction/Reference No Yes
No 7978 858
Yes 746 7866
Measure Score
Accuracy 0.9081
Kappa 0.8161
Sensitivity 0.9017
Specificity 0.9145

The metrics prove to be exceptionally good. The True Positive rate (Sensitivity) is 90% and the True Negative Rate is 91%. Accuracy rate is also 90% which means that 9 out of 10 data points are being classified correctly. Also Kappa is high when considering that the threshold for an acceptable value is 0.6. This strong predictive capability of the model can be attributed to the oversampling done for the minority class during the explorative phase. The two classes were perfectly balanced from then on and therefore there is no significant difference between Sensitivity and Specificty since the model was not trained more on one of the two classes. Hence, no adjustment of the threshold is necessary.

Logistics parameters

set.seed(1234)

data3$murder <- as.factor(data3$murder)
data3$murder <- relevel(data3$murder,ref="No")

var <- c("racePctWhite", "TotalPctDiv", "pctWWage", "blackPerCap", 
"PolicOperBudg", "PctPolicMinor", "LemasPctOfficDrugUn", "PctForeignBorn", 
"MedRentPctHousInc", "PolicAveOTWorked", "PersPerFam", "PolicPerPop", 
"racepctblack", "PctBornSameState", "PctPolicWhite", "indianPerCap", 
"PctWorkMom")

vardep <- "murder"
data4<-data3[,c(var,"murder")]

logistica<-glm(factor(murder)~.,data=data4,family="binomial")

summary(logistica) 

Call:
glm(formula = factor(murder) ~ ., family = "binomial", data = data4)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2170  -0.6984  -0.0274   0.6847   2.5775  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.38252    0.07386   5.179 2.23e-07 ***
racePctWhite        -1.10047    0.17659  -6.232 4.61e-10 ***
TotalPctDiv          0.57164    0.06273   9.113  < 2e-16 ***
pctWWage            -0.30159    0.06704  -4.498 6.85e-06 ***
blackPerCap         -0.17502    0.07262  -2.410 0.015946 *  
PolicOperBudg        0.93382    0.31275   2.986 0.002828 ** 
PctPolicMinor        0.76118    0.21739   3.501 0.000463 ***
LemasPctOfficDrugUn  0.25866    0.06891   3.754 0.000174 ***
PctForeignBorn      -0.24932    0.09881  -2.523 0.011630 *  
MedRentPctHousInc    0.15239    0.05775   2.639 0.008318 ** 
PolicAveOTWorked    -0.16298    0.06112  -2.667 0.007658 ** 
PersPerFam           0.21978    0.08602   2.555 0.010616 *  
PolicPerPop         -0.17377    0.07021  -2.475 0.013331 *  
racepctblack         0.32598    0.16874   1.932 0.053371 .  
PctBornSameState    -0.09319    0.05908  -1.577 0.114703    
PctPolicWhite        0.26132    0.16028   1.630 0.103021    
indianPerCap        -0.10508    0.07364  -1.427 0.153609    
PctWorkMom          -0.08553    0.06030  -1.418 0.156063    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4031.3  on 2907  degrees of freedom
Residual deviance: 2618.4  on 2890  degrees of freedom
AIC: 2654.4

Number of Fisher Scoring iterations: 7
Variable Coefficient
racePctWhite -1.10047
TotalPctDiv 0.57164
pctWWage -0.30159
blackPerCap -0.17502
PolicOperBudg 0.93382
PctPolicMinor 0.76118
LemasPctOfficDrugUn 0.25866
PctForeignBorn -0.24932
MedRentPctHousInc 0.15239
PolicAveOTWorked -0.16298
PersPerFam 0.21978
PolicPerPop -0.17377
racepctblack 0.32598
PctBornSameState -0.09319
PctPolicWhite 0.26132
indianPerCap -0.10508
PctWorkMom -0.08553

The table of coefficients shows that the PolicOprBudg variable has the highest positive coefficient (0.93). This means that, according to the model, having a high Police Budget is associated with a higher probability of the community having above-avergae murder rates. Similarly, communities with the a high PctPolicMinor also have a higher probability of high murder rate.

On the contrary, the coefficients of racePctWhite (the highest in absolute terms), has a negative impact on the prediction. That is, a high percentage of Caucasians is associated with a lower probability of having above national average murder rates.

In total we have only 1 very important variable (> +-1.0): racePctWhite.

Decision Tree

In order to also offer an easy interpretable analysis to the decision-makers, a decision tree will be executed. It will also serve to see which variables are being regarded as important and how they overlap with the ones above. When creating the decision tree its parameters will be chosen in such way that at least 10 observations must be included in every leaf (minbucket=10), which in turn prevents small splits at node level, a tree that is too deep and overfitting. Furthermore, to guarantee the most effective splits by decreasing the impurity of the corresponding child node, the Gini impurity criterion will be taken advantage of.

  # Arbol simple
  library(rpart)
  library(rpart.plot)
  set.seed(123)
  arbol1 <- rpart(murder ~ ., data = data3,
                  minbucket =10,method = "class",parms=list(split="gini"),cp=0)
  
  
  rpart.plot(arbol1,extra=105,nn=TRUE, tweak=1.2) 

The variables that were selected by the decision tree are racePctWhite, blackPerCap, TotalPctDiv, MedRentPctHousInc, pctWWage, indianPerCap, PctWorkMom, PctForeignBorn, PctBornSameState, PersPerFam and PctPolicMinor. All of them appear as well in the selection done by AIC. Interestingly only one police-related factor was chosen by the decision tree as opposed to six by AIC. It can be assumed that this is advantageous when recalling the high numbers of NAs present in the variables of this group.

The best models created through this analysis are:

Model AUC Error Rate
SVM RBF 0.9383 0.1226
GBM 0.9297 0.1218
Neuronal Network 0.9208 0.1385
Predi11 0.9530 0.0976
Predi15 0.9514 0.0947

Looking at this final table comparing the AUC and Error Rates of the best models of the ensemble and our simpler models developed throughout this work, we can see that the best model for AUC is predi11, a combination of the svmRadial and GBM models. However, for Error Rate Predi15 has the lowest score. Nevertheless, after having made the hypothesis testing we can conclude that it is better to stay with predi15, since there since the difference in AUC between the two models’ means is higher and more significant than the difference in Error Rate.